broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
280 stars 50 forks source link

Not saving ckpt.tar.gz checkpoint #371

Open IsauraMaria96 opened 2 months ago

IsauraMaria96 commented 2 months ago

Hi,

Thanks for the great tool. Recently I've installed CellBender in an Ubuntu server, and I've been having a problem in which the ckpt checkpoint is not saved, and thus the tool is uncapable of completing the process. Has anyone else had this problem? Thanks a lot.

Full log is attached: Error.log

System description:

Log: cellbender:remove-background: Command: cellbender remove-background --cuda --input /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5 --output /home/neurofisiologia/DatosRefinados.h5 cellbender:remove-background: CellBender 0.3.2 cellbender:remove-background: (Workflow hash 346ca8efb8) cellbender:remove-background: 2024-07-03 09:51:30 cellbender:remove-background: Running remove-background cellbender:remove-background: Loading data from /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5 cellbender:remove-background: CellRanger v3 format cellbender:remove-background: Features in dataset: 38606 Gene Expression cellbender:remove-background: Trimming features for inference. cellbender:remove-background: 33572 features have nonzero counts. cellbender:remove-background: Prior on counts for cells is 3741 cellbender:remove-background: Prior on counts for empty droplets is 295 cellbender:remove-background: Excluding 8942 features that are estimated to have <= 0.1 background counts in cells. cellbender:remove-background: Including 24630 features in the analysis. cellbender:remove-background: Trimming barcodes for inference. cellbender:remove-background: Excluding barcodes with counts below 147 cellbender:remove-background: Using 2575 probable cell barcodes, plus an additional 10272 barcodes, and 71346 empty droplets. cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts. cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmprc18nole cellbender:remove-background: No saved checkpoint. cellbender:remove-background: No checkpoint loaded. cellbender:remove-background: Running inference... cellbender:remove-background: [epoch 001] average training loss: 6661.9639 cellbender:remove-background: [epoch 002] average training loss: 6034.6147 (3.7 seconds per epoch) cellbender:remove-background: Will checkpoint every 114 epochs cellbender:remove-background: [epoch 003] average training loss: 5427.5748 cellbender:remove-background: [epoch 004] average training loss: 5108.2541 cellbender:remove-background: [epoch 005] average training loss: 4923.2049 cellbender:remove-background: [epoch 005] average test loss: 4961.6450 cellbender:remove-background: [epoch 006] average training loss: 4714.6100 cellbender:remove-background: [epoch 007] average training loss: 4658.0975 cellbender:remove-background: [epoch 008] average training loss: 4686.5494 cellbender:remove-background: [epoch 009] average training loss: 4636.7682 cellbender:remove-background: [epoch 010] average training loss: 4599.6784 cellbender:remove-background: [epoch 010] average test loss: 4674.5524 cellbender:remove-background: [epoch 011] average training loss: 4629.7057 cellbender:remove-background: [epoch 012] average training loss: 4552.1350 cellbender:remove-background: [epoch 013] average training loss: 4496.8647 cellbender:remove-background: [epoch 014] average training loss: 4308.9377 cellbender:remove-background: [epoch 015] average training loss: 4275.1747 cellbender:remove-background: [epoch 015] average test loss: 4324.3197 cellbender:remove-background: [epoch 016] average training loss: 4261.2428 cellbender:remove-background: [epoch 017] average training loss: 4251.0613 cellbender:remove-background: [epoch 018] average training loss: 4228.1749 cellbender:remove-background: [epoch 019] average training loss: 4206.0814 cellbender:remove-background: [epoch 020] average training loss: 4197.3849 cellbender:remove-background: [epoch 020] average test loss: 4191.5520 cellbender:remove-background: [epoch 021] average training loss: 4190.3577 cellbender:remove-background: [epoch 022] average training loss: 4154.5904 cellbender:remove-background: [epoch 023] average training loss: 4119.1000 cellbender:remove-background: [epoch 024] average training loss: 4101.0069 cellbender:remove-background: [epoch 025] average training loss: 4077.4471 cellbender:remove-background: [epoch 025] average test loss: 4076.8579 cellbender:remove-background: [epoch 026] average training loss: 4079.1548 cellbender:remove-background: [epoch 027] average training loss: 4060.0420 cellbender:remove-background: [epoch 028] average training loss: 4041.2950 cellbender:remove-background: [epoch 029] average training loss: 4023.0368 cellbender:remove-background: [epoch 030] average training loss: 4001.7430 cellbender:remove-background: [epoch 030] average test loss: 3975.9369 cellbender:remove-background: [epoch 031] average training loss: 3994.5689 cellbender:remove-background: [epoch 032] average training loss: 3992.0950 cellbender:remove-background: [epoch 033] average training loss: 3986.7607 cellbender:remove-background: [epoch 034] average training loss: 3997.4167 cellbender:remove-background: [epoch 035] average training loss: 3991.3141 cellbender:remove-background: [epoch 035] average test loss: 3993.9262 cellbender:remove-background: [epoch 036] average training loss: 3998.2393 cellbender:remove-background: [epoch 037] average training loss: 3989.8854 cellbender:remove-background: [epoch 038] average training loss: 3982.2416 cellbender:remove-background: [epoch 039] average training loss: 3980.3234 cellbender:remove-background: [epoch 040] average training loss: 3984.4739 cellbender:remove-background: [epoch 040] average test loss: 3973.4658 cellbender:remove-background: [epoch 041] average training loss: 3974.9065 cellbender:remove-background: [epoch 042] average training loss: 3984.2641 cellbender:remove-background: [epoch 043] average training loss: 3975.1879 cellbender:remove-background: [epoch 044] average training loss: 3971.4374 cellbender:remove-background: [epoch 045] average training loss: 3974.2532 cellbender:remove-background: [epoch 045] average test loss: 3950.0547 cellbender:remove-background: [epoch 046] average training loss: 3970.9828 cellbender:remove-background: [epoch 047] average training loss: 3964.1729 cellbender:remove-background: [epoch 048] average training loss: 3962.0764 cellbender:remove-background: [epoch 049] average training loss: 3971.4048 cellbender:remove-background: [epoch 050] average training loss: 3970.0651 cellbender:remove-background: [epoch 050] average test loss: 3958.7704 cellbender:remove-background: [epoch 051] average training loss: 3973.9497 cellbender:remove-background: [epoch 052] average training loss: 3970.4156 cellbender:remove-background: [epoch 053] average training loss: 3965.1261 cellbender:remove-background: [epoch 054] average training loss: 3975.3828 cellbender:remove-background: [epoch 055] average training loss: 3969.5423 cellbender:remove-background: [epoch 055] average test loss: 3932.6834 cellbender:remove-background: [epoch 056] average training loss: 3964.7342 cellbender:remove-background: [epoch 057] average training loss: 3967.4058 cellbender:remove-background: [epoch 058] average training loss: 3971.9959 cellbender:remove-background: [epoch 059] average training loss: 3960.5551 cellbender:remove-background: [epoch 060] average training loss: 3964.4331 cellbender:remove-background: [epoch 060] average test loss: 3967.9076 cellbender:remove-background: [epoch 061] average training loss: 3965.4153 cellbender:remove-background: [epoch 062] average training loss: 3962.5914 cellbender:remove-background: [epoch 063] average training loss: 3965.0319 cellbender:remove-background: [epoch 064] average training loss: 3965.6907 cellbender:remove-background: [epoch 065] average training loss: 3960.0795 cellbender:remove-background: [epoch 065] average test loss: 3945.7927 cellbender:remove-background: [epoch 066] average training loss: 3964.4541 cellbender:remove-background: [epoch 067] average training loss: 3968.9065 cellbender:remove-background: [epoch 068] average training loss: 3958.4191 cellbender:remove-background: [epoch 069] average training loss: 3963.3575 cellbender:remove-background: [epoch 070] average training loss: 3954.3709 cellbender:remove-background: [epoch 070] average test loss: 4007.2453 cellbender:remove-background: [epoch 071] average training loss: 3958.2268 cellbender:remove-background: [epoch 072] average training loss: 3961.9567 cellbender:remove-background: [epoch 073] average training loss: 3968.9788 cellbender:remove-background: [epoch 074] average training loss: 3962.2250 cellbender:remove-background: [epoch 075] average training loss: 3967.0552 cellbender:remove-background: [epoch 075] average test loss: 3997.2249 cellbender:remove-background: [epoch 076] average training loss: 3955.0682 cellbender:remove-background: [epoch 077] average training loss: 3960.1321 cellbender:remove-background: [epoch 078] average training loss: 3966.0317 cellbender:remove-background: [epoch 079] average training loss: 3953.0031 cellbender:remove-background: [epoch 080] average training loss: 3957.0243 cellbender:remove-background: [epoch 080] average test loss: 4002.7144 cellbender:remove-background: [epoch 081] average training loss: 3963.4742 cellbender:remove-background: [epoch 082] average training loss: 3964.5696 cellbender:remove-background: [epoch 083] average training loss: 3967.0997 cellbender:remove-background: [epoch 084] average training loss: 3967.0555 cellbender:remove-background: [epoch 085] average training loss: 3969.6566 cellbender:remove-background: [epoch 085] average test loss: 4005.2764 cellbender:remove-background: [epoch 086] average training loss: 3979.3970 cellbender:remove-background: [epoch 087] average training loss: 3971.2706 cellbender:remove-background: [epoch 088] average training loss: 3979.9692 cellbender:remove-background: [epoch 089] average training loss: 3991.1880 cellbender:remove-background: [epoch 090] average training loss: 3984.4977 cellbender:remove-background: [epoch 090] average test loss: 4012.9531 cellbender:remove-background: [epoch 091] average training loss: 3979.9608 cellbender:remove-background: [epoch 092] average training loss: 3980.1110 cellbender:remove-background: [epoch 093] average training loss: 3990.6269 cellbender:remove-background: [epoch 094] average training loss: 3987.8105 cellbender:remove-background: [epoch 095] average training loss: 4003.3267 cellbender:remove-background: [epoch 095] average test loss: 4026.3201 cellbender:remove-background: [epoch 096] average training loss: 4011.9168 cellbender:remove-background: [epoch 097] average training loss: 4001.7220 cellbender:remove-background: [epoch 098] average training loss: 4002.4815 cellbender:remove-background: [epoch 099] average training loss: 4014.8439 cellbender:remove-background: [epoch 100] average training loss: 4009.8107 cellbender:remove-background: [epoch 100] average test loss: 4034.6981 cellbender:remove-background: [epoch 101] average training loss: 4001.8132 cellbender:remove-background: [epoch 102] average training loss: 4000.4273 cellbender:remove-background: [epoch 103] average training loss: 4000.4040 cellbender:remove-background: [epoch 104] average training loss: 3996.6345 cellbender:remove-background: [epoch 105] average training loss: 4007.3502 cellbender:remove-background: [epoch 105] average test loss: 4046.1299 cellbender:remove-background: [epoch 106] average training loss: 3994.2900 cellbender:remove-background: [epoch 107] average training loss: 4018.2631 cellbender:remove-background: [epoch 108] average training loss: 3995.7133 cellbender:remove-background: [epoch 109] average training loss: 3984.8872 cellbender:remove-background: [epoch 110] average training loss: 4008.2703 cellbender:remove-background: [epoch 110] average test loss: 4043.1757 cellbender:remove-background: [epoch 111] average training loss: 4017.7784 cellbender:remove-background: [epoch 112] average training loss: 4017.0501 cellbender:remove-background: [epoch 113] average training loss: 4021.3158 cellbender:remove-background: [epoch 114] average training loss: 3994.4110 cellbender:remove-background: Saving a checkpoint... cellbender:remove-background: Could not save checkpoint cellbender:remove-background: Traceback (most recent call last): File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint torch.save(model_obj, filebase + '_model.torch') File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record) File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save pickler.dump(obj) TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: [epoch 115] average training loss: 4016.8244 cellbender:remove-background: [epoch 115] average test loss: 4036.3020 cellbender:remove-background: [epoch 116] average training loss: 4017.2557 cellbender:remove-background: [epoch 117] average training loss: 3996.7196 cellbender:remove-background: [epoch 118] average training loss: 4004.9664 cellbender:remove-background: [epoch 119] average training loss: 4022.4710 cellbender:remove-background: [epoch 120] average training loss: 4019.5331 cellbender:remove-background: [epoch 120] average test loss: 4067.2432 cellbender:remove-background: [epoch 121] average training loss: 4008.7457 cellbender:remove-background: [epoch 122] average training loss: 4001.0307 cellbender:remove-background: [epoch 123] average training loss: 3998.2867 cellbender:remove-background: [epoch 124] average training loss: 4001.8232 cellbender:remove-background: [epoch 125] average training loss: 4055.3543 cellbender:remove-background: [epoch 125] average test loss: 4058.0449 cellbender:remove-background: [epoch 126] average training loss: 4003.1687 cellbender:remove-background: [epoch 127] average training loss: 4017.3536 cellbender:remove-background: [epoch 128] average training loss: 4019.2687 cellbender:remove-background: [epoch 129] average training loss: 4028.9802 cellbender:remove-background: [epoch 130] average training loss: 4018.2229 cellbender:remove-background: [epoch 130] average test loss: 4026.8101 cellbender:remove-background: [epoch 131] average training loss: 4018.8546 cellbender:remove-background: [epoch 132] average training loss: 4002.1382 cellbender:remove-background: [epoch 133] average training loss: 4011.3291 cellbender:remove-background: [epoch 134] average training loss: 4009.5174 cellbender:remove-background: [epoch 135] average training loss: 3999.1352 cellbender:remove-background: [epoch 135] average test loss: 4015.5564 cellbender:remove-background: [epoch 136] average training loss: 3996.2076 cellbender:remove-background: [epoch 137] average training loss: 3995.8721 cellbender:remove-background: [epoch 138] average training loss: 4017.0538 cellbender:remove-background: [epoch 139] average training loss: 4017.7493 cellbender:remove-background: [epoch 140] average training loss: 3998.2958 cellbender:remove-background: [epoch 140] average test loss: 4049.0232 cellbender:remove-background: [epoch 141] average training loss: 3991.3952 cellbender:remove-background: [epoch 142] average training loss: 4022.6591 cellbender:remove-background: [epoch 143] average training loss: 3992.5597 cellbender:remove-background: [epoch 144] average training loss: 4008.8651 cellbender:remove-background: [epoch 145] average training loss: 3992.5097 cellbender:remove-background: [epoch 145] average test loss: 4121.4365 cellbender:remove-background: [epoch 146] average training loss: 4005.6093 cellbender:remove-background: [epoch 147] average training loss: 4021.3828 cellbender:remove-background: [epoch 148] average training loss: 3995.0772 cellbender:remove-background: [epoch 149] average training loss: 3985.9057 cellbender:remove-background: [epoch 150] average training loss: 4004.1677 cellbender:remove-background: [epoch 150] average test loss: 4030.2060 cellbender:remove-background: Saving a checkpoint... cellbender:remove-background: Could not save checkpoint cellbender:remove-background: Traceback (most recent call last): File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint torch.save(model_obj, filebase + '_model.torch') File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record) File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save pickler.dump(obj) TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: 2024-07-03 10:01:02 cellbender:remove-background: Inference procedure complete.

Sepidehsheybani commented 2 months ago

same problem, not able to save check points: Traceback (most recent call last): File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint torch.save(model_obj, filebase + '_model.torch') File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 628, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record) File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 840, in _save pickler.dump(obj) TypeError: cannot pickle 'weakref' object

lesolano commented 2 months ago

No solution, but I am encountering the same error. I have tested on v0.3.2, v0.3.0 and v0.2.2. Version 0.2.2 produces expected outputs, while the more recent versions produce the errors seen above.

abbey-green commented 1 month ago

I am also experiencing the same issue

aimutishammy commented 1 month ago

Same error.

abbey-green commented 1 month ago

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

Sepidehsheybani commented 1 month ago

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

Thank you, I will try it.

aimutishammy commented 1 month ago

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

This combination works for me. Thanks!

antsmer commented 2 weeks ago

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

Does anyone who got it working mind sharing what scipy version they are using? After using these three at the versions listed I get an error 'ValueError: row index exceeds matrix dimensions' which I'm hoping with be a quick fix after I switch to the correct scipy version. Thanks!