Closed jarheadjoe closed 1 year ago
Thanks so much for pointing out this issue! It seems that the bug was introduced while editing the raw figure file.
We will fix it, check all the other ratios again, and update our arXiv paper ASAP. Thanks again!
Hello, how is the percentage of code data counted, is it the percentage of github data?
Hello, how is the percentage of code data counted, is it the percentage of github data?
Yes. Typically, data collected from GitHub is categorized as "code".
thanks
How u get Ratios of various data sources in the pre-training data for existing LLMs in Fig2? As for me, the data in the Fig2 differs from the paper I read. For example, GPT-3 paper (https://arxiv.org/abs/2005.14165) did not mention conversation or code data. But in Fig2 GPT-3 used conversation and code data as pretrain data. And for PaLM, the Proportion of data in Table 2(https://arxiv.org/pdf/2204.02311.pdf) was also different from your ratios.