Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

Dataset weightage bug #185

Closed yhl48 closed 4 days ago

yhl48 commented 4 days ago

It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here

https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/combined.py#L74

cc: @tchaton

yhl48 commented 4 days ago

For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio.

I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂.

tchaton commented 4 days ago

Oh yes, good catch, we should take the length of each dataset normalized by the total.

dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens]