Dataset weightage bug - Githubissues

yhl48 commented 4 days ago

It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here

https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/combined.py#L74

cc: @tchaton

yhl48 commented 4 days ago

For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio.

I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂.

tchaton commented 4 days ago

Oh yes, good catch, we should take the length of each dataset normalized by the total.

dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens]

Lightning-AI / litdata

Dataset weightage bug #185