apple / ml-sigmoid-attention

Other
213 stars 9 forks source link

reproducing Language Modeling results #4

Closed Golovneva closed 3 days ago

Golovneva commented 4 days ago

Hi! Thank you for releasing the code. In the paper you report training Llama2 recipe on 300M tokens of RedPajama dataset. However, in your code I only found examples with Wikitext. Since RedPajama dataset is quite large with 50.6T estimated tokens, I was wondering how did you select 300M tokens for training?

jramapuram commented 4 days ago

Hi there, thanks for the interest. We didn't use this code for the 1B results. Take a look at the axlearn code for this: https://github.com/apple/axlearn/blob/main/axlearn/experiments/text/gpt/pajama_sigmoid_trainer.py

Golovneva commented 3 days ago

Thanks a lot!