EleutherAI / the-pile

MIT License
1.51k stars 128 forks source link

Mismatched data size Problem #114

Closed jaywaer closed 1 year ago

jaywaer commented 1 year ago

Hello,

I would like to express my gratitude for your outstanding contribution, and for sharing your code and dataset publicly.

However, after downloading and using the dataset, I came across some discrepancies in its size. For example, in Appendix C.4 of The Pipe paper, it states that a total of 17103059 documents were collected for OpenWebText2. Nevertheless, when I obtained the data from https://pile.eleuther.ai/ and checked the statistical data, the "meta": {"pipe_setname": indicated that the OpenWebText2 data volume was over 30000000, which is significantly larger. Similar inconsistencies were found in other data subsets. As a result, I am uncertain whether my statistical method or my interpretation of the number of datasets presented in the paper is accurate.

Could anyone kindly assist me in identifying the cause of this issue? I would be deeply appreciative.