broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
299 stars 54 forks source link

cellbender v3.0 doesn't generate most of the output files, but doesn't have any errors #356

Open cswoboda opened 6 months ago

cswoboda commented 6 months ago

Hello! Amazing tool. I've been using an older form of cellbender, and installed a newer form on our universities computational cluster in order to make use of the new output files. However, the run successfully completes, but is missing the metrics output file, and the html summary. Here's the full run. Conda activate cellbender03 should be activating version 3.0.

LSBATCH: User input

BSUB -L /bin/bash

BSUB -W 10000

BSUB -n 2

BSUB -M 32000

BSUB -q gpu-v100

BSUB -gpu "num=1"

BSUB -R "span[hosts=1]"

module load anaconda3 module load cuda/10.1 conda init conda activate cellbender03 cellbender remove-background --input /data/cellbender-inputs/Control3_raw_feature_bc_matrix.h5 --output /data/cellbender-outs/Control3_cellbender_corrected.h5 --cuda --fpr 0.01 --epochs 150


Successfully completed.

Resource usage summary:

CPU time :                                   6978.86 sec.
Max Memory :                                 4443 MB
Average Memory :                             3590.09 MB
Total Requested Memory :                     32000.00 MB
Delta Memory :                               27557.00 MB
Max Swap :                                   2 MB
Max Processes :                              5
Max Threads :                                47
Run time :                                   3654 sec.
Turnaround time :                            3664 sec.

The output (if any) follows:

cellbender:remove-background: Command: cellbender remove-background --input /data/cellbender-inputs/Control3_raw_feature_bc_matrix.h5 --output /data/cellbender-outs/Control3_cellbender_corrected.h5 --cuda --fpr 0.01 --epochs 150 cellbender:remove-background: 2024-05-08 13:59:08 cellbender:remove-background: Running remove-background cellbender:remove-background: Loading data from file /data/cellbender-inputs/Control3_raw_feature_bc_matrix.h5 cellbender:remove-background: CellRanger v3 format cellbender:remove-background: Trimming dataset for inference. cellbender:remove-background: Including 26514 genes that have nonzero counts. cellbender:remove-background: Prior on counts in empty droplets is 734 cellbender:remove-background: Prior on counts for cells is 7208 cellbender:remove-background: Excluding barcodes with counts below 367 cellbender:remove-background: Using 11265 probable cell barcodes, plus an additional 13735 barcodes, and 57467 empty droplets. cellbender:remove-background: Largest surely-empty droplet has 895 UMI counts. cellbender:remove-background: Running inference... cellbender:remove-background: [epoch 001] average training loss: 5347.6744 cellbender:remove-background: [epoch 002] average training loss: 4374.9999 (17.9 seconds per epoch) cellbender:remove-background: [epoch 003] average training loss: 4150.1302 cellbender:remove-background: [epoch 004] average training loss: 4071.7820 cellbender:remove-background: [epoch 005] average training loss: 4022.4898 cellbender:remove-background: [epoch 005] average test loss: 3973.0700 cellbender:remove-background: [epoch 006] average training loss: 3964.3670 cellbender:remove-background: [epoch 007] average training loss: 3945.6980 cellbender:remove-background: [epoch 008] average training loss: 3941.6377 cellbender:remove-background: [epoch 009] average training loss: 3928.8291 cellbender:remove-background: [epoch 010] average training loss: 3924.5242 cellbender:remove-background: [epoch 010] average test loss: 3912.0716 cellbender:remove-background: [epoch 011] average training loss: 3927.8465 cellbender:remove-background: [epoch 012] average training loss: 3922.8986 cellbender:remove-background: [epoch 013] average training loss: 3915.8853 cellbender:remove-background: [epoch 014] average training loss: 3914.1459 cellbender:remove-background: [epoch 015] average training loss: 3916.1158 cellbender:remove-background: [epoch 015] average test loss: 3902.9800 cellbender:remove-background: [epoch 016] average training loss: 3904.8473 cellbender:remove-background: [epoch 017] average training loss: 3855.8081 cellbender:remove-background: [epoch 018] average training loss: 3834.8974 cellbender:remove-background: [epoch 019] average training loss: 3823.8857 cellbender:remove-background: [epoch 020] average training loss: 3810.1713 cellbender:remove-background: [epoch 020] average test loss: 3789.7216 cellbender:remove-background: [epoch 021] average training loss: 3797.5985 cellbender:remove-background: [epoch 022] average training loss: 3791.8554 cellbender:remove-background: [epoch 023] average training loss: 3781.5092 cellbender:remove-background: [epoch 024] average training loss: 3772.7266 cellbender:remove-background: [epoch 025] average training loss: 3768.8021 cellbender:remove-background: [epoch 025] average test loss: 3755.8819 cellbender:remove-background: [epoch 026] average training loss: 3764.1117 cellbender:remove-background: [epoch 027] average training loss: 3759.5244 cellbender:remove-background: [epoch 028] average training loss: 3756.0278 cellbender:remove-background: [epoch 029] average training loss: 3749.2436 cellbender:remove-background: [epoch 030] average training loss: 3748.2135 cellbender:remove-background: [epoch 030] average test loss: 3734.8273 cellbender:remove-background: [epoch 031] average training loss: 3743.8361 cellbender:remove-background: [epoch 032] average training loss: 3740.2271 cellbender:remove-background: [epoch 033] average training loss: 3737.6663 cellbender:remove-background: [epoch 034] average training loss: 3735.4438 cellbender:remove-background: [epoch 035] average training loss: 3733.5313 cellbender:remove-background: [epoch 035] average test loss: 3727.7530 cellbender:remove-background: [epoch 036] average training loss: 3731.4035 cellbender:remove-background: [epoch 037] average training loss: 3729.3962 cellbender:remove-background: [epoch 038] average training loss: 3726.4831 cellbender:remove-background: [epoch 039] average training loss: 3720.7083 cellbender:remove-background: [epoch 040] average training loss: 3718.1618 cellbender:remove-background: [epoch 040] average test loss: 3709.2272 cellbender:remove-background: [epoch 041] average training loss: 3715.1231 cellbender:remove-background: [epoch 042] average training loss: 3712.0890 cellbender:remove-background: [epoch 043] average training loss: 3709.8681 cellbender:remove-background: [epoch 044] average training loss: 3707.4289 cellbender:remove-background: [epoch 045] average training loss: 3704.8866 cellbender:remove-background: [epoch 045] average test loss: 3694.5016 cellbender:remove-background: [epoch 046] average training loss: 3701.9518 cellbender:remove-background: [epoch 047] average training loss: 3698.5169 cellbender:remove-background: [epoch 048] average training loss: 3697.1491 cellbender:remove-background: [epoch 049] average training loss: 3696.5791 cellbender:remove-background: [epoch 050] average training loss: 3694.7941 cellbender:remove-background: [epoch 050] average test loss: 3688.0470 cellbender:remove-background: [epoch 051] average training loss: 3692.0785 cellbender:remove-background: [epoch 052] average training loss: 3689.6233 cellbender:remove-background: [epoch 053] average training loss: 3688.2671 cellbender:remove-background: [epoch 054] average training loss: 3683.3029 cellbender:remove-background: [epoch 055] average training loss: 3680.9655 cellbender:remove-background: [epoch 055] average test loss: 3672.7299 cellbender:remove-background: [epoch 056] average training loss: 3679.0682 cellbender:remove-background: [epoch 057] average training loss: 3677.7769 cellbender:remove-background: [epoch 058] average training loss: 3677.3356 cellbender:remove-background: [epoch 059] average training loss: 3680.4859 cellbender:remove-background: [epoch 060] average training loss: 3678.1786 cellbender:remove-background: [epoch 060] average test loss: 3673.8649 cellbender:remove-background: [epoch 061] average training loss: 3682.1170 cellbender:remove-background: [epoch 062] average training loss: 3688.7590 cellbender:remove-background: [epoch 063] average training loss: 3682.2961 cellbender:remove-background: [epoch 064] average training loss: 3679.7014 cellbender:remove-background: [epoch 065] average training loss: 3681.1230 cellbender:remove-background: [epoch 065] average test loss: 3666.8923 cellbender:remove-background: [epoch 066] average training loss: 3687.8158 cellbender:remove-background: [epoch 067] average training loss: 3679.9039 cellbender:remove-background: [epoch 068] average training loss: 3676.7172 cellbender:remove-background: [epoch 069] average training loss: 3679.4219 cellbender:remove-background: [epoch 070] average training loss: 3686.8794 cellbender:remove-background: [epoch 070] average test loss: 3681.0164 cellbender:remove-background: [epoch 071] average training loss: 3683.1021 cellbender:remove-background: [epoch 072] average training loss: 3681.1868 cellbender:remove-background: [epoch 073] average training loss: 3679.5698 cellbender:remove-background: [epoch 074] average training loss: 3679.2296 cellbender:remove-background: [epoch 075] average training loss: 3676.3138 cellbender:remove-background: [epoch 075] average test loss: 3682.0727 cellbender:remove-background: [epoch 076] average training loss: 3683.8161 cellbender:remove-background: [epoch 077] average training loss: 3673.5035 cellbender:remove-background: [epoch 078] average training loss: 3675.6524 cellbender:remove-background: [epoch 079] average training loss: 3673.3778 cellbender:remove-background: [epoch 080] average training loss: 3670.5102 cellbender:remove-background: [epoch 080] average test loss: 3677.3251 cellbender:remove-background: [epoch 081] average training loss: 3672.2921 cellbender:remove-background: [epoch 082] average training loss: 3669.9947 cellbender:remove-background: [epoch 083] average training loss: 3678.6621 cellbender:remove-background: [epoch 084] average training loss: 3665.4948 cellbender:remove-background: [epoch 085] average training loss: 3665.8716 cellbender:remove-background: [epoch 085] average test loss: 3696.9737 cellbender:remove-background: [epoch 086] average training loss: 3664.5997 cellbender:remove-background: [epoch 087] average training loss: 3664.0920 cellbender:remove-background: [epoch 088] average training loss: 3662.7945 cellbender:remove-background: [epoch 089] average training loss: 3662.9274 cellbender:remove-background: [epoch 090] average training loss: 3661.2173 cellbender:remove-background: [epoch 090] average test loss: 3679.8258 cellbender:remove-background: [epoch 091] average training loss: 3660.9971 cellbender:remove-background: [epoch 092] average training loss: 3659.0103 cellbender:remove-background: [epoch 093] average training loss: 3658.9914 cellbender:remove-background: [epoch 094] average training loss: 3659.7554 cellbender:remove-background: [epoch 095] average training loss: 3658.9406 cellbender:remove-background: [epoch 095] average test loss: 3670.1030 cellbender:remove-background: [epoch 096] average training loss: 3659.8987 cellbender:remove-background: [epoch 097] average training loss: 3656.0618 cellbender:remove-background: [epoch 098] average training loss: 3658.0610 cellbender:remove-background: [epoch 099] average training loss: 3656.7072 cellbender:remove-background: [epoch 100] average training loss: 3655.7021 cellbender:remove-background: [epoch 100] average test loss: 3665.2102 cellbender:remove-background: [epoch 101] average training loss: 3655.3289 cellbender:remove-background: [epoch 102] average training loss: 3655.0225 cellbender:remove-background: [epoch 103] average training loss: 3653.9965 cellbender:remove-background: [epoch 104] average training loss: 3654.9325 cellbender:remove-background: [epoch 105] average training loss: 3655.3952 cellbender:remove-background: [epoch 105] average test loss: 3663.0204 cellbender:remove-background: [epoch 106] average training loss: 3653.4235 cellbender:remove-background: [epoch 107] average training loss: 3653.4157 cellbender:remove-background: [epoch 108] average training loss: 3654.6665 cellbender:remove-background: [epoch 109] average training loss: 3652.7165 cellbender:remove-background: [epoch 110] average training loss: 3652.5705 cellbender:remove-background: [epoch 110] average test loss: 3657.3691 cellbender:remove-background: [epoch 111] average training loss: 3653.1412 cellbender:remove-background: [epoch 112] average training loss: 3651.6951 cellbender:remove-background: [epoch 113] average training loss: 3651.6466 cellbender:remove-background: [epoch 114] average training loss: 3652.0693 cellbender:remove-background: [epoch 115] average training loss: 3650.7289 cellbender:remove-background: [epoch 115] average test loss: 3658.8186 cellbender:remove-background: [epoch 116] average training loss: 3651.3572 cellbender:remove-background: [epoch 117] average training loss: 3649.8415 cellbender:remove-background: [epoch 118] average training loss: 3650.6730 cellbender:remove-background: [epoch 119] average training loss: 3649.3739 cellbender:remove-background: [epoch 120] average training loss: 3651.6230 cellbender:remove-background: [epoch 120] average test loss: 3651.8215 cellbender:remove-background: [epoch 121] average training loss: 3650.3769 cellbender:remove-background: [epoch 122] average training loss: 3650.4961 cellbender:remove-background: [epoch 123] average training loss: 3649.4580 cellbender:remove-background: [epoch 124] average training loss: 3648.7183 cellbender:remove-background: [epoch 125] average training loss: 3648.5544 cellbender:remove-background: [epoch 125] average test loss: 3651.8633 cellbender:remove-background: [epoch 126] average training loss: 3649.0808 cellbender:remove-background: [epoch 127] average training loss: 3649.6059 cellbender:remove-background: [epoch 128] average training loss: 3648.7723 cellbender:remove-background: [epoch 129] average training loss: 3649.6041 cellbender:remove-background: [epoch 130] average training loss: 3648.0021 cellbender:remove-background: [epoch 130] average test loss: 3651.2660 cellbender:remove-background: [epoch 131] average training loss: 3647.5236 cellbender:remove-background: [epoch 132] average training loss: 3645.7249 cellbender:remove-background: [epoch 133] average training loss: 3648.4092 cellbender:remove-background: [epoch 134] average training loss: 3647.1331 cellbender:remove-background: [epoch 135] average training loss: 3647.2023 cellbender:remove-background: [epoch 135] average test loss: 3648.2057 cellbender:remove-background: [epoch 136] average training loss: 3647.3049 cellbender:remove-background: [epoch 137] average training loss: 3646.0022 cellbender:remove-background: [epoch 138] average training loss: 3647.8876 cellbender:remove-background: [epoch 139] average training loss: 3646.8940 cellbender:remove-background: [epoch 140] average training loss: 3647.3962 cellbender:remove-background: [epoch 140] average test loss: 3648.7842 cellbender:remove-background: [epoch 141] average training loss: 3647.2094 cellbender:remove-background: [epoch 142] average training loss: 3645.5910 cellbender:remove-background: [epoch 143] average training loss: 3647.5731 cellbender:remove-background: [epoch 144] average training loss: 3647.6937 cellbender:remove-background: [epoch 145] average training loss: 3646.7066 cellbender:remove-background: [epoch 145] average test loss: 3649.2645 cellbender:remove-background: [epoch 146] average training loss: 3646.4411 cellbender:remove-background: [epoch 147] average training loss: 3646.0331 cellbender:remove-background: [epoch 148] average training loss: 3646.6610 cellbender:remove-background: [epoch 149] average training loss: 3646.0654 cellbender:remove-background: [epoch 150] average training loss: 3645.8741 cellbender:remove-background: [epoch 150] average test loss: 3651.3797 cellbender:remove-background: Inference procedure complete. cellbender:remove-background: 2024-05-08 14:45:48 cellbender:remove-background: Preparing to write outputs to file... cellbender:remove-background: Optimal posterior regularization factor = 2.37 cellbender:remove-background: Succeeded in writing CellRanger v3 format output to file /data/cellbender-outs/Control3_cellbender_corrected.h5 cellbender:remove-background: Succeeded in writing CellRanger v3 format output to file /data/cellbender-outs/Control3_cellbender_corrected_filtered.h5 cellbender:remove-background: Saved cell barcodes in /data/cellbender-outs/Control3_cellbender_corrected_cell_barcodes.csv cellbender:remove-background: Saved summary plots as /data/cellbender-outs/Control3_cellbender_corrected.pdf cellbender:remove-background: Completed remove-background. cellbender:remove-background: 2024-05-08 14:59:50

RobStrasser commented 3 weeks ago

Fellow Cellbender user here. This looked the same for me, where the error ("RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 23.50 GiB total capacity; 2.59 GiB already allocated; 1.59 GiB free; 2.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF") was only visible in reports outside of the cellbender log, i.e. the output directly in a screen session or the log file of the SLURM job.

If that indeed is the same for you, it could be that your sample is too large for the GPU memory (your resource overview seems to show CPU parameters only). What worked for me was funnily enough, simply re-running the tool. Until now, I don't know why.

Btw: My env is configured like this:

name: cellbender
channels:
  - conda-forge
  - pytorch
  - nvidia
dependencies:
# picked specifically according to https://github.com/broadinstitute/CellBender/issues/230#issuecomment-1689039233
  - python=3.7.12
  - pytorch=1.12.1
  - pytorch-cuda=11.7
  - pytorch-mutex=1.0
  - torchaudio=0.12.1
  - torchvision=0.13.0
  - pytables=3.7.0
  - pip
  - pip:
    - cellbender==0.3.0
    - lxml_html_clean # to solve issue with report writing