I looked at the hindi monolingual corpus (this) and it seems to have shuffled lines instead of contiguous news articles (this was specifically mentioned to be the case for the v1 public release here but I can't find that mentioned in the v2 release). Eg. there's random numbered points scattered in the file that are probably related to each other but that context is lost due to shuffling?
Is there a non-shuffled dataset available anywhere, or something with more metadata like scraping URL, date/time etc.?
I looked at the hindi monolingual corpus (this) and it seems to have shuffled lines instead of contiguous news articles (this was specifically mentioned to be the case for the v1 public release here but I can't find that mentioned in the v2 release). Eg. there's random numbered points scattered in the file that are probably related to each other but that context is lost due to shuffling? Is there a non-shuffled dataset available anywhere, or something with more metadata like scraping URL, date/time etc.?