EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

Fix CommonCrawlDataset #93

Open researcher2 opened 2 years ago

researcher2 commented 2 years ago

Looks like the file had the end cut off (the checksum change gives it away). I have updated the datasets.py to account for this:

Updated size, num_docs, checksum. Handle error on last document.

CLAassistant commented 1 year ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.