facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.6k stars 463 forks source link

Problem with wet_lines #228

Closed vmenan closed 1 year ago

vmenan commented 1 year ago

Hi, I am a researcher working on Low resource languages native to sri lanka (which is Sinhala and Tamil). NLLB mined dataset is a excellent start point for us. So i am using the instructions provided on how to download the mined dataset using the metadata provided here . The issue im facing is the meta data contains data from paracrawl as well, but the scripts and instructions provided work only for common crawl data. Am i going wrong on how to obtain the mined data from NLLB200?

heffernankevin commented 1 year ago

Hi @vmenan, an easier entry point to the mined data might be here. Hopefully this helps!

vmenan commented 1 year ago

@heffernankevin you are a life saver!, was struggling with the download for a week. Wow this actually really helps. Thank you so much! Im wondering why it wasnt it mentioned mentioned here ?

heffernankevin commented 1 year ago

No problem! Will make a TODO to add this.

vmenan commented 1 year ago

Thats great to hear. Once again thank you so much for you help! Appreciate it! Also props to FAIR research team to open sourcing their excellent work to the community, Thank you!