Open esivonxay-cognitiv opened 6 days ago
Hey @esivonxay-cognitiv, Thanks for the reproducible script. I will have a look into it.
Thanks Thomas!
Hey @esivonxay-cognitiv I am curious, what's your interest and usage of LitData ?
Yeah, I'm interested in LitData primarily for the ability to sample from multiple streams. I've got 2 datasets which are quite imbalanced (one is 100,000x larger than the other) and I'm trying to downsample one dataset to reduce the imbalance by a couple orders of magnitude.
Naively, I could do this when constructing the dataset by throwing out datapoints. However, doing so will result in me throwing out 90 or 99% of the data (to decrease the imbalance by 10x or 100x, respectively). It's possible that important samples may be thrown out in this process.
My thought was to do this downsampling/rebalancing during dataloading so the model at least has a chance to see each sample, just at a lower rate.
🐛 Bug
I have two datasets which are unbalanced, where one dataset is 1000x larger than the other. I would like to sample from two of the datasets such that the ratio of samples from each is 1:100. When doing so, the batches are of irregular size are returned during iteration.
I think there are 2 issues which this test surfaces: 1) The first batch returned by each worker is not properly sized. 2)
drop_last
does not appear to work as intended, since the last batch is not a full sized batchI don't think this is related to #179, but it's possible
I've been attempting to fix this, but I'm not sure what the root of the issue is. I would be very appreciative if you could fix this or point me in the right direction.
Thanks!
To Reproduce
Expected behavior
All batch sizes should be the same.
Additional context
This issue is independent of whether
drop_last
,shuffle
, andpersistent_workers
are set to True or False