broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
299 stars 54 forks source link

Report issue saving checkpoint #386

Open acerdenno opened 1 month ago

acerdenno commented 1 month ago

When running cellbender in slurm, two different errors prompt: 1.- when: Saving a checkpoint... cellbender:remove-background: Could not save checkpoint 2.- TypeError: cannot pickle 'weakref' object Any clues on how to solve them? Thanks!

JThomasWatson commented 1 month ago

I'm encountering the same error as #2. Below is the error message, in cast it's helpful.

cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 652, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 864, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref' object

cellbender:remove-background: 2024-10-22 14:36:38
cellbender:remove-background: Inference procedure complete.
Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/bin/cellbender", line 8, in <module>
    sys.exit(main())
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/base_cli.py", line 118, in main
    cli_dict[args.tool].run(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 193, in run
    return main(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 227, in main
    posterior = run_remove_background(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/run.py", line 123, in run_remove_background
    posterior = load_or_compute_posterior_and_save(
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/posterior.py", line 59, in load_or_compute_posterior_and_save
    assert os.path.exists(args.input_checkpoint_tarball), \
AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Could this be an issue with torch version?

mcsimenc commented 3 weeks ago

I ran cellbender for the first time, using CPU, not using a cluster, and get the same error, with no output produced, although at the end of the log it says "Inference procedure complete.". The call and the log file output are below.

cellbender remove-background \
        --input raw_feature_bc_matrix.h5 \
        --output raw_feature_bc_matrix.nuclei.h5 \
        --cpu-threads 24 \
        >cb.out 2>cb.err
(base) [msimenc@KIWI outs]$ cat raw_feature_bc_matrix.nuclei.log 
cellbender:remove-background: Command:
cellbender remove-background --input raw_feature_bc_matrix.h5 --output raw_feature_bc_matrix.nuclei.h5 --cpu-threads 24
cellbender:remove-background: CellBender 0.3.0
cellbender:remove-background: (Workflow hash 8ebc86ffdb)
cellbender:remove-background: 2024-11-01 17:16:03
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Features in dataset: 30940 Gene Expression
cellbender:remove-background: Trimming features for inference.
cellbender:remove-background: 24319 features have nonzero counts.
cellbender:remove-background: Prior on counts for cells is 911
cellbender:remove-background: Prior on counts for empty droplets is 198
cellbender:remove-background: Excluding 1976 features that are estimated to have <= 0.1 background counts in cells.
cellbender:remove-background: Including 22343 features in the analysis.
cellbender:remove-background: Trimming barcodes for inference.
cellbender:remove-background: Excluding barcodes with counts below 99
cellbender:remove-background: Using 3155 probable cell barcodes, plus an additional 9078 barcodes, and 49577 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts.
cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmphjs6xrze
cellbender:remove-background: No saved checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
cellbender:remove-background: [epoch 001]  average training loss: 2895.4787
cellbender:remove-background: [epoch 002]  average training loss: 2773.4995  (100.7 seconds per epoch)
cellbender:remove-background: Will checkpoint every 5 epochs
cellbender:remove-background: [epoch 003]  average training loss: 2684.2793
cellbender:remove-background: [epoch 004]  average training loss: 2610.8373
cellbender:remove-background: [epoch 005]  average training loss: 2557.5633
cellbender:remove-background: [epoch 005] average test loss: 2566.8680
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 850, in save
    _save(
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 1088, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
.
.
.
more epochs reports, more of the same error,
.
.
.
TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: 2024-11-01 20:00:22
cellbender:remove-background: Inference procedure complete.

The /tmp dir is writable:

(base) [msimenc@KIWI outs]$ ls -l /
drwxrwxrwt.   16 root root    20480 Nov  1 22:13 tmp

I just installed cellbender using pip this afternoon. Any ideas?

ezgiisenn commented 3 weeks ago

I've been successfully running cellbender version0.3.0 and 0.3.2 on our LSF-based computing cluster without issues until recently. However, in the past month, I’ve also started encountering the same error: TypeError: cannot pickle 'weakref.ReferenceType' object. Suggestions are appreciated to tackle the issue, thank you in advance!

GFrosi commented 3 weeks ago

Hi,

I am getting the same error using cellbender 0.3.0. I installed it via pip (python 3.11.5) in the HPC. I did not run it on my data. I am just trying to use the example data from the github, and the error is there.

Any updates about the issue? It would be super helpful.

AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Thank you.