Open StephennFernandes opened 2 years ago
The version of IndicCorpus does not contain Oscar. However, the newer version that you can find here contains OSCAR as a subset - https://indicnlp.ai4bharat.org/corpora/
@anoopkunchukuttan how do i find the previous version of the corpus (that doesn't contain Oscar) ? btw, the Oscar Corpus is generated from common crawl corpus. when you said that the IndicCorpus does not contain Oscar. does it mean the IndicCorpus does not contain content from common crawl ?
@anoopkunchukuttan Hello Sir, just a follow up on the previous question.
is there a way i could get the corpus in unshuffled format ?
as i would be adding content from oscar corpus separately. also additionally is there a way i could get the corpus in unshuffled format.
Hey there, Does IndicCorpus and OSCAR corpus come from the same source. ie: CommonCrawl ? i have been thinking to combining OSCAR + IndicCorpus to get a better and bigger corpus(with deduplication). Just wanted to confirm if the IndicCorpus and OSCAR are the same corpus at source or not.