Open asmlgkj opened 4 months ago
I am getting the same error, is there any solution to this issue? Thanks!
Are you referring to Regression model or Cell2location model?
Regression model should not have any issues with this. You can check
availability of GPU memory with nvidia-smi
command.
On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.***> wrote:
I am getting the same error, is there any solution to this issue? Thanks!
— Reply to this email directly, view it on GitHub https://github.com/BayraktarLab/cell2location/issues/372#issuecomment-2229120362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I am getting 'OutOfMemoryError: CUDA out of memory. Tried to allocate 1.85 GiB. GPU ' with Cell2location model. And its not memory issue because I am assigning lot more HPC resources than it needs.
It's actually very weird because it works fine with the same object where I have applied 'median_abs_deviation' filtering on 'log1p_total_counts' on each sample before concatenating it. its only some 1700 spots difference between the two. Do you know why some (outlier) spots causing this error? Thanks @vitkl for your help!
I got the same problem...I found a large usage of GPU memory by "/usr/lib/rstudio-server/bin/rsession", after ending this process by "kill -9 PID", memory was released. But after running _mod.train(max_epochs=30000, batch_size=None, trainsize=1), the similar process popped up again and took up ~7000MiB again! I repeated this action for several times, which made me confused...
Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with
nvidia-smi
command. … On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>
I met the problem while running cell2location model~
mod = cell2location.models.Cell2location(adata_vis, cell_state_df=inf_aver, N_cells_per_location=10, detection_alpha=20) mod.train(max_epochs=30000, batch_size=None, train_size=1)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader
but have no validation_step
. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argumentto
num_workers=103in the
DataLoader` to improve performance.
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 1/30000: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "
Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with
nvidia-smi
command. … On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>
@avpatel18 You probably need to look into GPU memory settings rather than RAM settings on your cluster.
cell2location.models.Cell2location
model needs a large amount of GPU memory. For example, in 80GB GPU memory you can fit a dataset with n_obs ~ 60k and n_vars ~ 18k. 7GB would be enough for only 1-2 Visium sections.
thanks a lot. here is the install step. export PYTHONNOUSERSITE="aaaaa" conda create -y -n cell2location_cuda118_torch22 python=3.10 conda activate cell2location_cuda118_torch22
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
when run mod.train(max_epochs=250)
GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a
val_dataloader
but have novalidation_step
. Skipping val loop. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] /home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers
argumentto
num_workers=23in the
DataLoader` to improve performance. Epoch 1/250: 0%| | 0/250 [00:00<?, ?it/s]OutOfMemoryError Traceback (most recent call last) OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU