cisnlp / GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
https://arxiv.org/abs/2310.16248
Apache License 2.0
84 stars 7 forks source link

add source #3

Open kargaranamir opened 5 months ago

kargaranamir commented 5 months ago

Group A: Please add here any possible speculation to have cleaner sources and evaluation data.

Group B: Please add any possible new sources here, especially those concerning languages not included.

kargaranamir commented 5 months ago

Group A:

kargaranamir commented 4 months ago

Group B:

MedAymenF commented 2 months ago

Group B:

* add domain and multilple langs from [Pontoon-Translations](https://huggingface.co/datasets/ayymen/Pontoon-Translations): cleaning is a bit challenging

Are you talking about cleaning the data itself or the metadata (lang codes)? I intend to release new versions of both Pontoon Translations and Weblate Translations (which has more languages BTW, but probably less quality for LID), but I'm not really sure how I'm going to fix lang codes.

kargaranamir commented 2 months ago

Group B:

* add domain and multilple langs from [Pontoon-Translations](https://huggingface.co/datasets/ayymen/Pontoon-Translations): cleaning is a bit challenging

Are you talking about cleaning the data itself or the metadata (lang codes)? I intend to release new versions of both Pontoon Translations and Weblate Translations (which has more languages BTW, but probably less quality for LID), but I'm not really sure how I'm going to fix lang codes.

about the cleaning, I meant more the tags like <playIcon> or {$goal}, for LID it should be removed, or otherwise it learn bad features. It's not too difficult, but it should be done. I will check your HF every once in a while to see if you publish anything new.

laubonghaudoi commented 2 months ago

Can you clarify why https://github.com/facebookresearch/flores/issues/61 is solved? I don't see any update in their data.

kargaranamir commented 2 months ago

@laubonghaudoi For my project (GlotLID), the issue is resolved because I deleted the yue in my Flores benchmark. This project is GlotLID, which trains a better language identification system. Flores-200 is one of the benchmarks I used.

But to answer your question in general, this issue is not resolved in Flores-200 at its root. They made another project to maintain Flores: https://github.com/openlanguagedata/flores, but that also does not address this issue! Maybe someone needs to bring up this issue in the new project again.