There may be some errors in Figure 6

RUCAIBox / LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".

https://arxiv.org/abs/2303.18223

10.13k stars 798 forks source link

There may be some errors in Figure 6 #83

Closed Asunatan closed 6 months ago

Asunatan commented 6 months ago

error

Asunatan commented 6 months ago

error

hyp1231 commented 6 months ago

Hi, thanks for your interest in our paper! Could you please provide more details about the errors in the listed data source figures?

Take LLaMA as one example, we refer to the original paper, and categorize:

(1) CommonCrawl (67%), C4 (15%), and Wikipedia (4.5%) as "Webpages" (86.5%),
(2) GitHub (4.5%) as "Code" (4.5%),
(3) Gutenberg and Books3 (4.5%) as "Books & News" (4.5%),
(4) ArXiv (2.5%) as "Scientific Data" (2.5%), and
(5) Stack Exchange (2%) as "Conversational Data" (2%).

Asunatan commented 6 months ago

Thank you very much for your answer. In the original LLaMA paper, the sources for the dataset are listed as follows: English CommonCrawl (67%), C4 (15%), Github (4.5%), Wikipedia (4.5%), Gutenberg and Books3 (4.5%), ArXiv (2.5%), and Stack Exchange (2%), adding up to a total of 100%. I'm confused as to why in Fig. 6 the sum is 102%.

hyp1231 commented 6 months ago

This is because the software that we use to draw this figure automatically rounds up and only displays the rounded number, which results in a sum exceeding 100% because four of the categories end with .5% (round-up to +1%).