Closed braceal closed 3 years ago
Happens during distributed training
Potential cause:
Each rank was creating it's own virtual h5 file with the same file name for training. One rank was likely writing while the other was reading.
Solution:
Create virtual h5 file on rank 0 and broadcast to the rest.
The above solution fixed the problem.