Closed GianFree closed 4 years ago
cc @mtiberti @wouterboomsma @tbengtsen
Thanks @GianFree for the report.
I've been investigating this, it seems to be related to using joblib
to parallelise the calculation.
The crash happens when joblib
automatically converts one of our numpy.array
argument, which is where we store the calculated values, to numpy.memmap
(quoting from their docs: "Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.") and then can't write to it. This behaviour is triggered when the size of the input array exceeds the default max_nbytes
which is why we don't see it in all cases.
The fix could easily be adding the max_nbytes=None
(which stops this behaviour) or making the memmap
s writeable (mmap_mode='w+'
). This seems to work, however at the end of the calculation we get the initialised array instead of the results (!) which means the array is not properly shared. The suggested way to handle this scenario is to create a numpy.memmap
explicitly instead of a numpy.array
and give it as an argument for the parallel job - this seems to work flawlessly, however I'm a bit unsure how to do this practically, since numpy.memmap
writes an actual file on the filesystem. As a reference, joblib
transparently does this:
However I would prefer avoiding to create a file on file system to begin with, the best would of course keep everything in memory. Any idea on how we could proceed with this?
Using a shared writable memmap
will very likely cause random crashes or randomly wrong results. If two processes happen to write to the same memmap at the same time and location data corruption is a very likely possibility. I would be careful with a shared writable array. Having shared writes is an issue anyway, it works best if you can guarantee that no to processes will write to the same memory location, easiest done if every iteration in the loop-body updates a different entry.
Expected behavior Parallel computation (with 2 or more cores) of a rmsd matrix with
MDAnalysis.analysis.encore.confdistmatrix.get_distance_matrix()
.Actual behavior Running the following command with my PSF and DCD, I get this error.
Output:
Code to reproduce the behavior
I cannot reproduce the same error if I use the test file or a different number of cores. And nothing changed in my enviroment between the two computation.
I also checked my data, but they are correctly read by VMD and does not appear to be corrupted. Using only one core, everything works fine.
Output:
Currently version of MDAnalysis
python -V
)? Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) on jupyter-notebook