broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
297 stars 54 forks source link

reuse of checkpoint file #266

Open qianzhengzong opened 1 year ago

qianzhengzong commented 1 year ago

hi great authors, I runned CellBender with colab, successful for the training part but failed with output generation, so i downloaded the "ckpt.tar.gz" file for local output generation, but seems can not reuse the checkpoint file locally even with the "--checkpoint" and ”--fpr“ flag specified, with an error "Workflow hash does not match that of checkpoint". since have no GPU on local machine, any idea of how to reuse the checkpoint file :) Many thanks!

piece of local log: /tmp/tmpncpo07v2/312105cc26_params.pyro /tmp/tmpncpo07v2/312105cc26_train.loaderstate /tmp/tmpncpo07v2/312105cc26_test.loaderstate /tmp/tmpncpo07v2/312105cc26_args.npy cellbender:remove-background: Workflow hash does not match that of checkpoint. cellbender:remove-background: No checkpoint loaded. cellbender:remove-background: Running inference... ^Ccellbender:remove-background: Inference procedure stopped by keyboard interrupt... will save a checkpoint. cellbender:remove-background: Saving a checkpoint... ^Ccellbender:remove-background: Keyboard interrupt: will not save checkpoint

piece of colab log: cellbender:remove-background: Working on chunk (21/255) cellbender:remove-background: Working on chunk (22/255) cellbender:remove-background: Working on chunk (23/255) cellbender:remove-background: Working on chunk (24/255) cellbender:remove-background: Working on chunk (25/255) cellbender:remove-background: Working on chunk (26/255) cellbender:remove-background: Working on chunk (27/255) cellbender:remove-background: Working on chunk (28/255) cellbender:remove-background: Working on chunk (29/255) cellbender:remove-background: Working on chunk (30/255)

CalledProcessError Traceback (most recent call last)

in () ----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\npip install -q cellbender\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n') 3 frames /usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self) 135 def check_returncode(self): 136 if self.returncode: --> 137 raise subprocess.CalledProcessError( 138 returncode=self.returncode, cmd=self.args, output=self.output 139 ) CalledProcessError: Command 'eval "$(conda shell.bash hook)" conda activate cellbender python --version pip install -q cellbender cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5 ' died with .
sjfleming commented 1 year ago

Hi @qianzhengzong , I think your best option would be to retry training on the colab GPU. I think what's happening (based on the log file) is this error #251 related to the work here #263 .

That "SIGKILL: 9" error likely means the machine ran out of memory. If you can, please try to install cellbender (in the notebook), with the following command instead of pip install cellbender:

pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980

(note to self: this is a commit from the sf_memory_efficient_posterior_generation branch on Aug 30, 2023)

Hopefully these code changes will help the process complete successfully on the colab GPU.

qianzhengzong commented 1 year ago

Hi @sjfleming , yes, the sf_memory_efficient_posterior_generation branch get me further on colab, but still get killed after all chunks finished. since colab has a limit of 13G free cpu memory, i used a smaller .h5, the result_posterior.h5 file was produced but still not fully successful. finally using a much more smaller .h5 file, reached complete successful.

below is the log in case you want to improve the memory usage part based on it :) large .h5 colab logs: cellbender:remove-background: Working on chunk (254/255) cellbender:remove-background: Working on chunk (255/255)

CalledProcessError Traceback (most recent call last) in <cell line: 1>() ----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\n# pip install -q cellbender\npip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')

3 frames /usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self) 135 def check_returncode(self): 136 if self.returncode: --> 137 raise subprocess.CalledProcessError( 138 returncode=self.returncode, cmd=self.args, output=self.output 139 )

CalledProcessError: Command 'eval "$(conda shell.bash hook)" conda activate cellbender python --version pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980 cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5 ' died with <Signals.SIGKILL: 9>.

smaller .h5 colab log: cellbender:remove-background: Working on chunk (73/73) cellbender:remove-background: Writing full posterior to result_posterior.h5 cellbender:remove-background: Succeeded in writing posterior to file result_posterior.h5 cellbender:remove-background: Added posterior object to checkpoint file. cellbender:remove-background: 2023-09-01 05:59:17

cellbender:remove-background: Saved summary plots as result.pdf cellbender:remove-background: Saved cell barcodes in result_cell_barcodes.csv cellbender:remove-background: Computing target noise counts per gene for MCKP estimator

CalledProcessError Traceback (most recent call last) in <cell line: 1>() ----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\n# pip install -q cellbender\npip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')

3 frames /usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self) 135 def check_returncode(self): 136 if self.returncode: --> 137 raise subprocess.CalledProcessError( 138 returncode=self.returncode, cmd=self.args, output=self.output 139 )

CalledProcessError: Command 'eval "$(conda shell.bash hook)" conda activate cellbender python --version pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980 cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5 ' died with <Signals.SIGKILL: 9>.

sjfleming commented 1 year ago

Hi @qianzhengzong , thanks for the reply. I didn't realize colab had a CPU memory limit of 13GB. Yes this will be hard to use for large samples...

Okay so your real question was: if you do the training on a colab GPU, can you finish the job on CPU by re-using the checkpoint. Let me try to answer that question!

The answer is, yes, I hope so! But it might not be quite so simple in practice. I use a "workflow hash code" to try to ensure that a checkpoint that's being re-used is appropriate for re-use (because the run parameters and cellbender source code are identical). I don't actually know if this will work appropriately if you run on one machine and then try to run on another machine. It might! I hope it does! Try using the --checkpoint input argument to specify your checkpoint file saved from the colab run.

If it says "workflow hashcode does not match" and starts to re-do the training, then we will have to hack our way around it. The easiest way to hack around it would be the following:

The log file starts with lines that look like this

cellbender:remove-background: CellBender 0.3.0
cellbender:remove-background: (Workflow hash ee55b84ac9)

Then it will show the workflow hash of the checkpoint file when it tries to open the checkpoint:

cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku
cellbender:remove-background: Successfully unpacked tarball to /var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_optim.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_params.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_random.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_train.loaderstate
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_optim.torch
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_args.npy
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_model.torch
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_test.loaderstate

When you run on CPU, look for the workflow hash (Workflow hash ee55b84ac9) and make note of it. Also make note of the workflow hash from the GPU run (the first part of the filenames of the tarball files) 284932b0a1. CellBender demands they match.

If you manually go in and change those filenames to match the CPU workflow hash (ee55b84ac9 in this case), the CellBender will be able to use that checkpoint.

sjfleming commented 1 year ago

I hope it "just works" automatically by specifying --checkpoint and that you don't have to go through that trouble.

qianzhengzong commented 1 year ago

hi @sjfleming , many thanks for the reply, i'll hack aroud it when get time, but i think the better choice is to get a gpu with high cpu memory to run this tool properly :)

kiklata commented 1 year ago

hi @sjfleming , after obtaining ckpt.tar.gz file in colab using GPU and in order to use it in local machine with CPU only, I followed your advice to manually change the filenames of the checkpoint files. However, it seems a GPU is required to calculate posterior.

cellbender:remove-background: Loaded partially-trained checkpoint from newckpt.tar.gz  
cellbender:remove-background: Checkpoint loaded successfully. 
cellbender:remove-background: Running inference... 
cellbender:remove-background: 2023-11-18 00:02:03 
cellbender:remove-background: Inference procedure complete. 
cellbender:remove-background: Attempting to unpack tarball "newckpt.tar.gz" to /tmp/tmpu__d5kz8 
cellbender:remove-background: Successfully unpacked tarball to /tmp/tmpu__d5kz8 
/tmp/tmpu__d5kz8/df7718350c_model.torch 
/tmp/tmpu__d5kz8/df7718350c_optim.torch
/tmp/tmpu__d5kz8/df7718350c_train.loaderstate
/tmp/tmpu__d5kz8/df7718350c_params.pyro 
/tmp/tmpu__d5kz8/df7718350c_test.loaderstate 
/tmp/tmpu__d5kz8/df7718350c_random.cuda 
/tmp/tmpu__d5kz8/df7718350c_random.pyro 
/tmp/tmpu__d5kz8/df7718350c_optim.pyro 
/tmp/tmpu__d5kz8/df7718350c_args.npy 
cellbender:remove-background: Posterior not currently included in checkpoint. 
....
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx