Closed chaoSefat closed 1 month ago
Perhaps a general text cleaner can be and should be written that would clean high resource languages out of scraped text rather that a site specific cleaner.
Beta cleaner made on colab. Can be integrated when text dataset is formed.
Texts scraped from glosbe have English artifacts that reoccur in text. Which includes:
Being a dictionary it provides translation from a language to the other. For example "https://glosbe.com/en/ada/yard" would translate yard from English to Adangme. Another page "https://glosbe.com/fi/ady/koskea" would translate Finnish word "koskea" to the language "Adyghe". Furthermore, they provide example sentence translated.
Possible Options: