bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
376 stars 49 forks source link

Create data composition #25

Open RaymondLi0 opened 1 year ago

RaymondLi0 commented 1 year ago

Once the stack v1.2 is done #24 , attribute weight to each data source (PL and NL).

Some PLs like HTML and CSS should probably be down-sampled. On the other hand, we might want to upsample some low-resource PLs.

For the final model the goal would be to have a limit of 5 epochs maximum for example, and check that each of the data sources stays under that limit. An estimated 600B training tokens can be used to check this limit.

RaymondLi0 commented 1 year ago

This is addressed by https://github.com/bigcode-project/bigcode-data-mix and #32 . It would be good to have a summary of the number of epochs done on each data-source for the final training