Closed LudwigStumpp closed 1 year ago
Thanks for this @LudwigStumpp , one small issue is that RedPajama dataset size 1.2T is Tokens not TB, we need to unify these somehow. Perhaps revert back to numbers of tokens where applicable?
For fine-tuning datasets they do not disclose token info so file size makes sense imho 😀
Good that you spotted this @Muhtasham. And for starcoderdata
? Is that also number of tokens?
But yes, you are right. If it is confusing to me, than it probably is confusing to many.
Generally speaking, we can choose between
Here my suggestion:
What do you think?
@Muhtasham apart from the still open dataset points. Do you think the new specification for the model size is clearer now, or does it degrade the readability?
@LudwigStumpp starcoderdata
was in gigabytes, in terms of tokens it is 1 trillion tokens. I think new specification for models sizes is more clearer, and readable.
Here are my two cents, regarding putting the number of tokens: Although I agree that it depends on the tokenizer, it is common thing after Chinchilla Optimal
movement.
So I think, Datasets for pretraining: # tokens Datasets for instruction-tuning: # samples or #storage size
@Muhtasham, are you sure that the size of starcoderdata
is 1 trillion tokens? The text only says "we trained a ~15B parameter model for 1 trillion tokens". I think, 1 trillion tokens here is the number of tokens of the dataset times the number of epochs trained, so not necessarily the size of the dataset. According to https://huggingface.co/datasets/bigcode/starcoderdata, the dataset is of size 783 + 54 + 13 + 32 = 882 GB. But I can't find any information about the number of tokens yet.
And RedPajama is 1.2 Tillion Tokens and 5TB storage. (just for me here to remember)
@Muhtasham, can you please check again? I also decided to add a new table for alignment-tuning
and moved the OpenAssistant Conversations Dataset
to there.
LGTM @LudwigStumpp thanks, I will double check starcoderdata tokens with some folks from HF and get back to you.
Awesome, will merge for now (to not run into any merge conflicts). Let's then create a new PR.
Wow, this is very thoughtful discussion. Thank you both for thinking hard on this! I can see how the collaboration has landed us in a better outcome than the initial proposal.
My two cents:
I would also suggest a tenet: We will do our best to curate useful and accurate information. As much as possible, we link to sources. Nonetheless, errors are inevitable and we will work with the community to resolve them.
For example, the starcoder data. I think it's okay to link to this post that states 1T tokens. It's the best datapoint we have now. And if we get more information that updates our understanding, we can correct it then.
Once again, thank you both! You rock 💪
Removed units (M and B) from the entries of the
Model Size
column and instead specified the unit (B) in the column name. Did the same for the dataset table.To me, it improves the readibility and simplifies.
@eugeneyan or @Muhtasham: what do you think? Probably this is highly debatable, which is why I opened this as a PR. At best you take a look at the .markdown preview of the branch.