Open carbonscott opened 1 month ago
Here’s what might have happened: let’s say I have 10 events in a run and I requested 3 workers. So work 1, 2, 3 will process 3 events during the first pass, respectively, then worker 0 should just process the last event. However, I guess PyTorch’s dataloader will pad the event list with fake event numbers so the worker will process event 10, 11, 12. This may explain why the xtc reader complains about having no events (or sometimes negative event numbers) because event 11 and 12 don’t exist (an overshooting problem).
This behavior was documented https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler, where it says about the drop_last
argument:
If False, the sampler will add extra indices to make the data evenly divisible across the replicas.
Need to check what event numbers are these.
Consult the Data System team about this issue.