How to get the Ratios of various data sources in the pre-training data?

RUCAIBox / LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".

10.13k stars 798 forks source link

How to get the Ratios of various data sources in the pre-training data? #18

Closed jarheadjoe closed 1 year ago

jarheadjoe commented 1 year ago

How u get Ratios of various data sources in the pre-training data for existing LLMs in Fig2? As for me, the data in the Fig2 differs from the paper I read. For example, GPT-3 paper (https://arxiv.org/abs/2005.14165) did not mention conversation or code data. But in Fig2 GPT-3 used conversation and code data as pretrain data. And for PaLM, the Proportion of data in Table 2(https://arxiv.org/pdf/2204.02311.pdf) was also different from your ratios.

hyp1231 commented 1 year ago

Thanks so much for pointing out this issue! It seems that the bug was introduced while editing the raw figure file.

We will fix it, check all the other ratios again, and update our arXiv paper ASAP. Thanks again!

Wangpeiyi9979 commented 1 year ago

Hello, how is the percentage of code data counted, is it the percentage of github data?

hyp1231 commented 1 year ago

Hello, how is the percentage of code data counted, is it the percentage of github data?

Yes. Typically, data collected from GitHub is categorized as "code".

Wangpeiyi9979 commented 1 year ago

thanks