Closed yhl48 closed 4 days ago
For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio.
I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂.
Oh yes, good catch, we should take the length of each dataset normalized by the total.
dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens]
It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here
https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/combined.py#L74
cc: @tchaton