Open sxjscience opened 3 years ago
Marked it as a good first issue because we have the documentation about how to add a new dataset in https://github.com/dmlc/gluon-nlp/tree/master/scripts/datasets
Hello,
The author of the OSCAR corpus here. After an increased amount of downloads in recent weeks, and continuous abuse by some users I had to take the corpus down (since the server was not capable of handling all that traffic).
We're currently working on transferring all the data to a new infrastructure, but in the meantime, could you wait some weeks in order to add OSCAR to GluonNLP ? All the corpus "distribution" is going to change, so you might loose some work if you add it right now.
Best, Pedro
@pjox thanks for letting us know, Pedro. Is there anything we can help with in terms of distribution? Let us know!
Hello once again!
We have managed to bring the corpus back online but we had to cut each subcorpora into smaller files and we had to put everything behind a login, however creating an account is free and you can actually use your existing accounts to login.
We're already working on some collaborations to make the corpus even more accessible.
Thank you for your patience.
Best, Pedro.
Hello @szha and @sxjscience I would like to work on this issue. For starters, I'll first add only 3 languages from the OSCAR Corpus as a prototype and then gradually move to the whole corpus once I get it fully from @pjox.
@utkarshsharma00 thanks for picking it up. Let us know if you need any help.
Description
OSCAR Corpus: https://oscar-corpus.com/