Change formatting of sizes columns

LudwigStumpp commented 1 year ago

Removed units (M and B) from the entries of the Model Size column and instead specified the unit (B) in the column name. Did the same for the dataset table.

To me, it improves the readibility and simplifies.

@eugeneyan or @Muhtasham: what do you think? Probably this is highly debatable, which is why I opened this as a PR. At best you take a look at the .markdown preview of the branch.

Muhtasham commented 1 year ago

Thanks for this @LudwigStumpp , one small issue is that RedPajama dataset size 1.2T is Tokens not TB, we need to unify these somehow. Perhaps revert back to numbers of tokens where applicable?

For fine-tuning datasets they do not disclose token info so file size makes sense imho 😀

LudwigStumpp commented 1 year ago

Good that you spotted this @Muhtasham. And for starcoderdata? Is that also number of tokens?

But yes, you are right. If it is confusing to me, than it probably is confusing to many.

Generally speaking, we can choose between

number of tokens: Depends on the tokenizer, so not optimal imo
number of samples: What is a sample? For instruction-tuning, it is one instruction-answer-pair, but how about others?
storage size: Very general (good), but probably very abstract

Here my suggestion:

datasets for pretraining: storage size
datasets for fine-tuning: # samples
rename fine-tuning to instruction-tuning

What do you think?

LudwigStumpp commented 1 year ago

@Muhtasham apart from the still open dataset points. Do you think the new specification for the model size is clearer now, or does it degrade the readability?

Muhtasham commented 1 year ago

@LudwigStumpp starcoderdata was in gigabytes, in terms of tokens it is 1 trillion tokens. I think new specification for models sizes is more clearer, and readable.

Here are my two cents, regarding putting the number of tokens: Although I agree that it depends on the tokenizer, it is common thing after Chinchilla Optimal movement.

So I think, Datasets for pretraining: # tokens Datasets for instruction-tuning: # samples or #storage size

LudwigStumpp commented 1 year ago

@Muhtasham, are you sure that the size of starcoderdata is 1 trillion tokens? The text only says "we trained a ~15B parameter model for 1 trillion tokens". I think, 1 trillion tokens here is the number of tokens of the dataset times the number of epochs trained, so not necessarily the size of the dataset. According to https://huggingface.co/datasets/bigcode/starcoderdata, the dataset is of size 783 + 54 + 13 + 32 = 882 GB. But I can't find any information about the number of tokens yet.

And RedPajama is 1.2 Tillion Tokens and 5TB storage. (just for me here to remember)

LudwigStumpp commented 1 year ago

@Muhtasham, can you please check again? I also decided to add a new table for alignment-tuning and moved the OpenAssistant Conversations Dataset to there.

Muhtasham commented 1 year ago

LGTM @LudwigStumpp thanks, I will double check starcoderdata tokens with some folks from HF and get back to you.

LudwigStumpp commented 1 year ago

Awesome, will merge for now (to not run into any merge conflicts). Let's then create a new PR.

eugeneyan commented 1 year ago

Wow, this is very thoughtful discussion. Thank you both for thinking hard on this! I can see how the collaboration has landed us in a better outcome than the initial proposal.

My two cents:

The updates to model size are non-controversial and a clear improvement.
Representing pre-training and instruction-tuning data as tokens and sample size aligns with how it's typically described.
Thank you for reorganizing the datasets as pre-training, instruction-tuning, etc. This ordering is better.

I would also suggest a tenet: We will do our best to curate useful and accurate information. As much as possible, we link to sources. Nonetheless, errors are inevitable and we will work with the community to resolve them.

For example, the starcoder data. I think it's okay to link to this post that states 1T tokens. It's the best datapoint we have now. And if we get more information that updates our understanding, we can correct it then.

Once again, thank you both! You rock 💪

eugeneyan / open-llms

Change formatting of sizes columns #28