braceal / molecules

Machine learning for molecular dynamics.
MIT License
5 stars 5 forks source link

SystemError: Negative size passed to PyBytes_FromStringAndSize #77

Open lee212 opened 3 years ago

lee212 commented 3 years ago

This happened when I try to aggregate 240 dcd files across 40 Summit nodes:

jsrun -n 40 -r 1 -a 6 -c 7 -d packed /gpfs/alpine/proj-shared/med110/conda/pytorch/bin/python 
"/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py" 
"-t" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/tmp.2VrCh27TOx"
 "-p" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-r" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-o" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/cvae_input.h5" 
"--contact_maps_parameters" "kernel_type=threshold,threshold=16" "-s" "protein and name CA" "--rmsd" "--fnc"
"--contact_map" "--point_cloud" "--num_workers" "2" "--distributed" "--verbose"

and the error:

Traceback (most recent call last):
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 99, in <module>
    main()
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 94, in main
    sel=selection, cm_format=cm_format, num_workers=num_workers, comm=mpi_comm, verbose=verbose)
  File "/gpfs/alpine/med110/proj-shared/hrlee/git/braceal/molecules/molecules/sim/dataset.py", line 547, in traj_to_dset
    rows_ = comm.gather(rows_, 0)
  File "mpi4py/MPI/Comm.pyx", line 1262, in mpi4py.MPI.Comm.gather
  File "mpi4py/MPI/msgpickle.pxi", line 680, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 685, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 148, in mpi4py.MPI.Pickle.allocv
  File "mpi4py/MPI/msgpickle.pxi", line 139, in mpi4py.MPI.Pickle.alloc
SystemError: Negative size passed to PyBytes_FromStringAndSize

I tried to add an exception handler to line 547, and set 0 to rows, cols to ignore when it's corrupted but it doesn't seem a correct patch. I will dig further but wanted to report this first.