"OutOfMemoryError: CUDA out of memory." in GPU mode

asmlgkj commented 4 months ago

thanks a lot. here is the install step. export PYTHONNOUSERSITE="aaaaa" conda create -y -n cell2location_cuda118_torch22 python=3.10 conda activate cell2location_cuda118_torch22

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

when run mod.train(max_epochs=250)

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] /home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=23in theDataLoader` to improve performance. Epoch 1/250: 0%| | 0/250 [00:00<?, ?it/s]

OutOfMemoryError Traceback (most recent call last) OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU

avpatel18 commented 3 months ago

I am getting the same error, is there any solution to this issue? Thanks!

vitkl commented 3 months ago

Are you referring to Regression model or Cell2location model?

Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command.

On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.***> wrote:

I am getting the same error, is there any solution to this issue? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/BayraktarLab/cell2location/issues/372#issuecomment-2229120362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

avpatel18 commented 3 months ago

I am getting 'OutOfMemoryError: CUDA out of memory. Tried to allocate 1.85 GiB. GPU ' with Cell2location model. And its not memory issue because I am assigning lot more HPC resources than it needs.

It's actually very weird because it works fine with the same object where I have applied 'median_abs_deviation' filtering on 'log1p_total_counts' on each sample before concatenating it. its only some 1700 spots difference between the two. Do you know why some (outlier) spots causing this error? Thanks @vitkl for your help!

LiuXintongPKU commented 3 months ago

I got the same problem...I found a large usage of GPU memory by "/usr/lib/rstudio-server/bin/rsession", after ending this process by "kill -9 PID", memory was released. But after running _mod.train(max_epochs=30000, batch_size=None, trainsize=1), the similar process popped up again and took up ~7000MiB again! I repeated this action for several times, which made me confused...

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command. … On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

LiuXintongPKU commented 3 months ago

I met the problem while running cell2location model~

mod = cell2location.models.Cell2location(adata_vis, cell_state_df=inf_aver, N_cells_per_location=10, detection_alpha=20) mod.train(max_epochs=30000, batch_size=None, train_size=1)

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] /home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=103in theDataLoader` to improve performance. /home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. Epoch 1/30000: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "", line 1, in File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/cell2location/models/_cell2location_model.py", line 209, in train super().train(kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/model/base/_pyromixin.py", line 191, in train return runner() File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 98, in call self.trainer.fit(self.training_plan, self.data_splitter) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainer.py", line 220, in fit super().fit(*args, *kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(args, kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run results = self._run_stage() File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage self.fit_loop.run() File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run self.advance() File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 242, in advance batch_output = self.manual_optimization.run(kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 92, in run self.advance(kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 112, in advance training_step_output = call._call_strategy_hook(trainer, "training_step", kwargs.values()) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step return self.lightning_module.training_step(*args, *kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainingplans.py", line 1048, in training_step loss = torch.Tensor([self.svi.step(args, kwargs)]) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/svi.py", line 145, in step loss = self.loss_and_grads(self.model, self.guide, *args, **kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 140, in loss_and_grads for model_trace, guide_trace in self._get_traces(model, guide, args, kwargs): File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/elbo.py", line 237, in _get_traces yield self._get_trace(model, guide, args, kwargs) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 57, in _get_trace model_trace, guide_trace = get_importance_trace( File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/enum.py", line 75, in get_importance_trace model_trace.compute_log_prob() File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/poutine/trace_struct.py", line 264, in compute_log_prob log_p = site["fn"].log_prob( File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/distributions/conjugate.py", line 283, in log_prob -log_beta(self.concentration, value + 1) File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/ops/special.py", line 68, in log_beta return x.lgamma() + y.lgamma() - (x + y).lgamma() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 548.00 MiB. GPU

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command. … On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

vitkl commented 3 months ago

@avpatel18 You probably need to look into GPU memory settings rather than RAM settings on your cluster.

vitkl commented 3 months ago

cell2location.models.Cell2location model needs a large amount of GPU memory. For example, in 80GB GPU memory you can fit a dataset with n_obs ~ 60k and n_vars ~ 18k. 7GB would be enough for only 1-2 Visium sections.

BayraktarLab / cell2location

"OutOfMemoryError: CUDA out of memory." in GPU mode #372