Open ben-8878 opened 5 months ago
I encountered the same issue. The way i resolved this is to put a torch.distributed.barrier before dataset loading to ensure all the processes are comleted before entering into data preparation step. This theoretically will reduce the chance of running into this issue. To further avoid it, i even put a time.sleep(10) before checking the integrity of the zip file in the same file where that error pops up. As of today, i haven't run into this issue after the above 2 modifications.
@disperaller has some sample codes ? thanks