Open zye1996 opened 2 months ago
looks like some batches are processed twice, more like a multi-processing issue.
Thanks for reporting @zye1996. I'll take a look.
@gabrielmbmb should this line return False? Otherwise, if the last batch arrives earlier than the previous batches, data are forced to be sent to the next step and some data could be missing if they cannot be created for another batch. Let me know if a PR is needed
I've also started noticing this on a pipline I've created. Using an input_batch_size of one on some text generation tasks led to the final data set size only containing one row for each processed batch of the previous output - which had been created using a step mixin and could not have an inforced batch size. @gabrielmbmb I have some code I can share that exhibits the issue that I can share as well.
Describe the bug The behavior is a bit random. When the text generation input size < batch size from the previous step and replica > 1. The final output could missing some samples. This does not happen every time but happens frequently. I suspect it has something to do with batch/multi-processing scheduling.
In the following case, default LoadDataFromDicts batch size is 50, and batch_size of Text generation is set lower than that, in this case 17. The total input sample number is 60, however, when saving the data to disk, only 52 samples are saved. When setting Text generation batch size greater than 50, all samples can be successfully saved.
To Reproduce Code to reproduce
Expected behaviour A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context Add any other context about the problem here.