Releasing the LongMIT dataset?

WowCZ / LongMIT

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets

32 stars 0 forks source link

Releasing the LongMIT dataset? #3

Closed lilakk closed 1 week ago

lilakk commented 3 weeks ago

Thanks for the amazing work! I was wondering if you have plans to release the LongMIT dataset you used to finetune models in your experiments? Also, what is the token count distribution for this dataset?

WowCZ commented 3 weeks ago

Thanks for your interest. We will release the LongMIT dataset. The token count distribution is as below: 5bfa1923800f5e21ac3cb57c2e3378b where the token count distribution of the LongMIT datasets is nearly uniform among 4k to 128k. Thanks for your reminder again. The introduce of the dataset distribution will be upated in the revisited paper.

lilakk commented 3 weeks ago

Thanks! Looking forward to the release.

WowCZ commented 2 weeks ago

We have released our LongMIT datasets (128K version). You can try it. If any problems or suggestions, it is welcome to create a issue.

lilakk commented 2 weeks ago

This is great, thanks! What does "128K" version mean? Is that a subset of the full dataset where each sample is 128K tokens?

lilakk commented 2 weeks ago

Additionally, in your paper you report in the Appendix that:

All models were trained using 64 A800*80G GPUs with the DeepSpeed+ZeRO-1 framework. The maximum sequence length was set to 32K, with any sequences exceeding this length truncated from the right. The training process utilized the Adam optimizer with a learning rate of 3×10−5 , β1 = 0.9, and β2 = 0.95. To enhance training efficiency, we employed a packing strategy that concatenates training samples to reach the maximum sequence length. Additionally, Flash Attention (Dao et al., 2022; Dao, 2024) is used to accelerate the computation of the attention mechanism. The global batch size consisted of 4 million tokens, and the entire dataset is trained over one epoch.

But it seems like your dataset does cover up to 128K tokens. Did you finetune on the full dataset? Or did you finetune on a subset that contains up to 32K tokens?

WowCZ commented 2 weeks ago

Sorry to confuse you about that. This work is still in progress. In our first Arixv version, we only finetuned the models on a subset that contains up to 32K tokens. Recently, we have expanded the 32K version into 128K version for better research exploration.

lilakk commented 2 weeks ago

I see, that makes sense. So by "128K version" you mean the full dataset containing sequences between 4K-128K tokens right?

WowCZ commented 1 week ago

Yes, 128K datasets contains 4K-128K tokens, whose distribution is nearly uniform.