braceal / molecules

Machine learning for molecular dynamics.
MIT License
5 stars 5 forks source link

HDF5 intermittent failure 2 #82

Closed braceal closed 3 years ago

braceal commented 3 years ago
Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 259, in main
    cms_transform=False,
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 118, in get_dataset
    cms_transform=cms_transform,
  File "/p/gpfs1/brace3/src/molecules/molecules/ml/datasets/point_cloud.py", line 62, in __init__
    with open_h5(self.file_path, 'r', libver = 'latest', swmr = False) as f:
  File "/p/gpfs1/brace3/src/molecules/molecules/utils/read_file.py", line 20, in open_h5
    return h5py.File(h5_file, mode, **kwargs)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file is already open for write (may use <h5clear file> to clear file consistency flags))
braceal commented 3 years ago

Happens during distributed training

braceal commented 3 years ago

Potential cause:

Each rank was creating it's own virtual h5 file with the same file name for training. One rank was likely writing while the other was reading.

Solution:

Create virtual h5 file on rank 0 and broadcast to the rest.

braceal commented 3 years ago

The above solution fixed the problem.