Closed lilakk closed 1 week ago
Thanks for your interest. We will release the LongMIT dataset. The token count distribution is as below: where the token count distribution of the LongMIT datasets is nearly uniform among 4k to 128k. Thanks for your reminder again. The introduce of the dataset distribution will be upated in the revisited paper.
Thanks! Looking forward to the release.
We have released our LongMIT datasets (128K version). You can try it. If any problems or suggestions, it is welcome to create a issue.
This is great, thanks! What does "128K" version mean? Is that a subset of the full dataset where each sample is 128K tokens?
Additionally, in your paper you report in the Appendix that:
All models were trained using 64 A800*80G GPUs with the DeepSpeed+ZeRO-1 framework. The maximum sequence length was set to 32K, with any sequences exceeding this length truncated from the right. The training process utilized the Adam optimizer with a learning rate of 3×10−5 , β1 = 0.9, and β2 = 0.95. To enhance training efficiency, we employed a packing strategy that concatenates training samples to reach the maximum sequence length. Additionally, Flash Attention (Dao et al., 2022; Dao, 2024) is used to accelerate the computation of the attention mechanism. The global batch size consisted of 4 million tokens, and the entire dataset is trained over one epoch.
But it seems like your dataset does cover up to 128K tokens. Did you finetune on the full dataset? Or did you finetune on a subset that contains up to 32K tokens?
Sorry to confuse you about that. This work is still in progress. In our first Arixv version, we only finetuned the models on a subset that contains up to 32K tokens. Recently, we have expanded the 32K version into 128K version for better research exploration.
I see, that makes sense. So by "128K version" you mean the full dataset containing sequences between 4K-128K tokens right?
Yes, 128K datasets contains 4K-128K tokens, whose distribution is nearly uniform.
Thanks for the amazing work! I was wondering if you have plans to release the LongMIT dataset you used to finetune models in your experiments? Also, what is the token count distribution for this dataset?