broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
282 stars 50 forks source link

Docker image: Could not save checkpoint (python3.7), due to [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed #301

Open plijnzaad opened 10 months ago

plijnzaad commented 10 months ago

Dear all,

we have great difficulty installing / running cellbender (see also issues #212, #275 and #296 ).

I hoped that the Docker image would be failsafe, but that is also not the case unfortunately. Using this image:

us.gcr.io/broad-dsde-methods/cellbender   latest    56439f37d58e   2 months ago   4.98GB

and converting it to a Singularity image (we are not root on our HPC) results in the following crash (full log appended). Does anyone know a combination of versions of (1) cellbender , (2) torch and (3) python that is likely to work? And is this an issue that is specific to the cellbender remove-background invocation?

cellbender:remove-back
[LX385-err.txt](https://github.com/broadinstitute/CellBender/files/13215090/LX385-err.txt)
ground: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 423, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 650, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed writing file data/9: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 424, in save
    return
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 290, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 90329472 vs 90329404
plijnzaad commented 10 months ago

LX385-err.txt (forgot to append the log)

tilofrei commented 7 months ago

Dear @plijnzaad I got the same error running from a singularity container - did work around that issue meanwhile? Thanks!