Closed Asunatan closed 6 months ago
Hi, thanks for your interest in our paper! Could you please provide more details about the errors in the listed data source figures?
Take LLaMA as one example, we refer to the original paper, and categorize:
Thank you very much for your answer. In the original LLaMA paper, the sources for the dataset are listed as follows: English CommonCrawl (67%), C4 (15%), Github (4.5%), Wikipedia (4.5%), Gutenberg and Books3 (4.5%), ArXiv (2.5%), and Stack Exchange (2%), adding up to a total of 100%. I'm confused as to why in Fig. 6 the sum is 102%.
This is because the software that we use to draw this figure automatically rounds up and only displays the rounded number, which results in a sum exceeding 100% because four of the categories end with .5% (round-up to +1%).