biomap-research / scFoundation

Apache License 2.0
183 stars 27 forks source link

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

Open 00dylan00 opened 1 week ago

00dylan00 commented 1 week ago

I encountered a CUDA Out of Memory error when running the script get_embedding.py with a small dataset containing 2 rows. Below are the details of the error and the command used to run the script.

Also what is your suggested environment for running scFoundation? how much GPU capacity is recommended?

Command Used:

sbatch test.3.sh /home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py --task_name SCAD_bulk_Etoposide --input_type bulk --output_type cell --pool_type all --tgthighres f1 --data_path X_df_sample.csv --save_path ./ --pre_normalized F --version ce --demo X_df_sample.csv contains the same data asX_df.csvbut with only 2 rows. Error Log:

Traceback (most recent call last):
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 305, in <module>
    main()
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 232, in main
    geneemb = pretrainmodel.encoder(x,x_padding)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sbnb/ddalton/projects/scFoundation/model/pretrainmodels/transformer.py", line 42, in forward
    x = mod(x, src_key_padding_mask=padding_mask) # , src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/transformer.py", line 506, in forward
    return torch._transformer_encoder_layer_fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.45 GiB (GPU 0; 23.69 GiB total capacity; 21.57 GiB already allocated; 980.06 MiB free; 21.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Memory Tracking I also tracked memory usage with this function:

def get_cuda_info():
  mem_alloc = "%fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024)
  mem_reserved = "%fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024)
  max_memory_reserved = "%fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024)
  return "GPU alloc: {}. Reserved: {}. MaxReserved: {}".format(mem_alloc,mem_reserved,max_memory_reserved)

At various steps in the get_embeddings.py script - jsut before geneemb = pretrainmodel.encoder(x,x_padding):

            #Cell embedding
            if args.output_type=='cell':
                position_gene_ids, _ = gatherData(data_gene_ids, value_labels, pretrainconfig['pad_token_id'])

                print(get_cuda_info())

                x = pretrainmodel.token_emb(torch.unsqueeze(x, 2).float(), output_weight = 0)
                print(x.shape)

                print(get_cuda_info())

                position_emb = pretrainmodel.pos_emb(position_gene_ids)
                x += position_emb

                print(get_cuda_info())

With the following output:

  0%|          | 0/2 [00:03<?, ?it/s]
GPU alloc: 0.445247GB. Reserved: 0.494141GB. MaxReserved: 0.494141GB
torch.Size([1, 15291, 768])
GPU alloc: 0.488881GB. Reserved: 0.558594GB. MaxReserved: 0.558594GB
GPU alloc: 0.532629GB. Reserved: 0.603516GB. MaxReserved: 0.603516GB

Environment Details PyTorch version: 1.13.1+cu117 CUDA version: 11.7 GPU: 24 GB total capacity

Thanks in advance!

WhirlFirst commented 5 days ago

Hi, The GPU memory required by scFoundation depends on the sparsity of the cell expression vector, not on the number of cells. We recommend using A100 40G or 80G for local inference.