test_tinyllama issue with LitData and `iterate_over_all`

Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

https://lightning.ai

Apache License 2.0

6.95k stars 733 forks source link

test_tinyllama issue with LitData and `iterate_over_all` #1399

Closed Andrei-Aksionov closed 3 weeks ago

Andrei-Aksionov commented 3 weeks ago

Hi there 👋

Apparently there is an issue with tinyllama test and the newest version of LitData (0.2.6). In the release notes one can see that iterate_over_all has just been added:

Add support for iterate_over_all for the CombinedDataset by @tchaton in https://github.com/Lightning-AI/litdata/pull/122

and that's why the issue didn't appear before.

Don't know whether this issue is on LitGPT or LitData side. Maybe @awaelchli has any thoughts?

awaelchli commented 3 weeks ago

LitData made the decision to enforce iterate_over_all by default as a breaking change. LitGPT will have to set iterate_over_all=False explicitly now and require litdata>=0.2.6. The error message needs to be fixed though.

tchaton commented 3 weeks ago

Yes, the default behaviour was confusing to some users. It felt more natural all the samples should be seen, especially when used for computing the validation metrics.

As @awaelchli shared, let's add iterate_over_all to LitGPT where needed.