cisnlp / GlotWeb

GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
https://cis-lmu-glotweb.hf.space
Creative Commons Zero v1.0 Universal
9 stars 0 forks source link

glosbe cleaning #6

Closed chaoSefat closed 1 month ago

chaoSefat commented 1 month ago

Texts scraped from glosbe have English artifacts that reoccur in text. Which includes:

Being a dictionary it provides translation from a language to the other. For example "https://glosbe.com/en/ada/yard" would translate yard from English to Adangme. Another page "https://glosbe.com/fi/ady/koskea" would translate Finnish word "koskea" to the language "Adyghe". Furthermore, they provide example sentence translated.

Possible Options:

chaoSefat commented 1 month ago

Perhaps a general text cleaner can be and should be written that would clean high resource languages out of scraped text rather that a site specific cleaner.

chaoSefat commented 1 month ago

Beta cleaner made on colab. Can be integrated when text dataset is formed.

chaoSefat commented 1 month ago

https://github.com/fedelopez77/langdetect