deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicCorrectMatrix sometimes crashes with cool format #706

Closed cgirardot closed 3 years ago

cgirardot commented 3 years ago

Hi @joachimwolff Using the lastest 3.6 version, I tried to change my WF (in galaxy) to now output .cool instead of .h5 files (at the hicBuildMatrix step); reasoning this will save space and be handier. When I run my updated WF, I had many crashes most with hicCorrectMatrix (diag & KR norm) with weird errors. I switched back to .h5 and this fixed all issues so I kinda ignored it... but now I having the same issues with cooler files again so there might be something wrong here.

The first error I have is

INFO:hicexplorer.hicCorrectMatrix:matrix contains 36357817 data points. Sparsity 0.002.
Traceback (most recent call last):
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/bin/hicCorrectMatrix", line 7, in <module>
    main()
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicexplorer/hicCorrectMatrix.py", line 767, in main
    ma.save(args.outFileName, pApplyCorrection=False)
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/HiCMatrix.py", line 103, in save
    self.matrixFileHandler.save(pMatrixName, pSymmetric=pSymmetric, pApplyCorrection=pApplyCorrection)
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/lib/matrixFileHandler.py", line 63, in save
    self.matrixFile.save(pName, pSymmetric, pApplyCorrection)
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/lib/cool.py", line 413, in save
    cooler.create_cooler(cool_uri=pFileName,
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/cooler/create/_create.py", line 1021, in create_cooler
    create(
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/cooler/create/_create.py", line 643, in create
    nnz, ncontacts = write_pixels(
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/cooler/create/_create.py", line 225, in write_pixels
    with h5py.File(filepath, "r+") as fw:
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/h5py/_hl/files.py", line 424, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/h5py/_hl/files.py", line 192, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 96, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

I first assumed it was an NFS issue but.. it is too frequent to be the right explanation and this never happens when I use h5.

If I re-run the same job, the second error is:

Traceback (most recent call last):
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/bin/hicCorrectMatrix", line 7, in <module>
    main()
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicexplorer/hicCorrectMatrix.py", line 600, in main
    ma = hm.hiCMatrix(args.matrix)
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/HiCMatrix.py", line 55, in __init__
    matrixFileHandler_load = self.matrixFileHandler.load()
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/lib/matrixFileHandler.py", line 57, in load
    return self.matrixFile.load()
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/hicmatrix/lib/cool.py", line 67, in load
    matrixDataFrame = cooler_file.matrix(balance=False, sparse=True, as_pixels=True)
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/__hicexplorer@3.6/lib/python3.8/site-packages/cooler/api.py", line 390, in matrix
    return RangeSelector2D(field, _slice, _fetch, (self._info["nbins"],) * 2)
KeyError: 'nbins'

I should also mention that I am processing 6 matrices in parallel and 3 go through without issue. They were all produced from raw h5 matrices further hicNormalized as described in #704 . Since the failure happens with some of the matrices only, I think my code is OK. Also the matrices look OK in higlass.

Any idea what could be wrong?

joachimwolff commented 3 years ago

Hi,

OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

That one says the file is already opened by some other process and is therefore locked.

The second error is one from the cooler library stating one crucial data entry is not present. What could have happened: You have for reasons now a corrupted file. Having the first crash and then the second, I think have seen this before. The only solution is to throw away the file and use a non-corrupted copy.

The question arises, why (or better how?) do you access the same file in parallel? To open any hdf in parallel might cause these issues and can lead to corrupted files. To be clear here: The problem is not you process six matrices in parallel, but that for unknown reasons at least two of them are identical and opened by two parallel processes.

Best,

Joachim

cgirardot commented 3 years ago

Thank you for your answer. It can indeed happen in a Galaxy workflow that 2 jobs using the same input would be launched in parallel. This never caused an issue when using h5 format but does with cool. This suggests to me that h5 supports concurrent access while cool needs a lock. Is this what you are suggesting? Did I get this right?

joachimwolff commented 3 years ago

Both, h5 and cool, are hdf5 based files. It surprises me a bit that only cool files are causing issues and h5 not. If you want to have parallel access you should implement a lock for both file types to be sure.