AI4Bharat / IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME
https://ai4bharat.iitm.ac.in/language-understanding
MIT License
73 stars 13 forks source link

Non shuffled data access #6

Open kektobiologist opened 8 months ago

kektobiologist commented 8 months ago

I looked at the hindi monolingual corpus (this) and it seems to have shuffled lines instead of contiguous news articles (this was specifically mentioned to be the case for the v1 public release here but I can't find that mentioned in the v2 release). Eg. there's random numbered points scattered in the file that are probably related to each other but that context is lost due to shuffling? Is there a non-shuffled dataset available anywhere, or something with more metadata like scraping URL, date/time etc.?