CUDA out of memory error

jungeun122333 commented 1 month ago

Dear authors, thank you for your impressive work.

I was trying to reproduce your code using the script code for the bear dataset. However, when I follow your script code,

CUDA_VISIBLE_DEVICES=0 ns-train gaussctrl --load-checkpoint unedited_models/bear/splatfacto/2024-09-06_145346/nerfstudio_models/step-000029999.ckpt --experiment-name bear --output-dir outputs --pipeline.datamanager.data ../dataset/bear --pipeline.prompt "a photo of a polar bear in the forest" --pipeline.guidance_scale 5 --pipeline.chunk_size 3 --pipeline.langsam_obj 'bear' --viewer.quit-on-train-completion True

It makes CUDA out of memory error, and it seems that it tried to allocate another 27G (!).

I'm confused since you said you used 24G NVIDIA RTX 56000, and I'm also using 24G NVIDIA RTX 3090.

Do you have any idea why this issue is happening? Any kind of advice would be very helpful.

This is the full error code

Done Reset Attention Processor
#############################                                                                                                                                                                  
Start Editing:                                                                                                                                                                                 
Reference views are [5, 12, 30, 32]     
#############################                                                                                                                                                                  
Generating view: [0, 1, 2]                                                                                                                                                                     
  0%|                                                                                                                                                                   | 0/20 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                             
  File "/home/server43/miniconda3/envs/je_3dedit/bin/ns-train", line 8, in <module>                                                                                                            
    sys.exit(entrypoint())              
             ^^^^^^^^^^^^                                                                                                                                                                      
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/nerfstudio/scripts/train.py", line 262, in entrypoint                                                            
    main(                                  
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/nerfstudio/scripts/train.py", line 247, in main      
    launch(                                                                                                                                                                                    
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)                                                                                                                              
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/nerfstudio/scripts/train.py", line 99, in train_loop                                                             
    trainer.setup()             
  File "/home/server43/jungeun_workspace/3D_Edit/gaussctrl/gaussctrl/gc_trainer.py", line 78, in setup                             
    self.pipeline.edit_images()                                                                                    
 File "/home/server43/jungeun_workspace/3D_Edit/gaussctrl/gaussctrl/gc_pipeline.py", line 205, in edit_images                                                                                 
    chunk_edited = self.pipe(                                                                                                                                                                  
                   ^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/pipelines/controlnet/pipeline_controlnet.py", line 1234, in __call__
    down_block_res_samples, mid_block_res_sample = self.controlnet(
                                                   ^^^^^^^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 804, in forward
    sample, res_samples = downsample_block(
                          ^^^^^^^^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 1199, in forward
    hidden_states = attn(
                    ^^^^^
File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^         
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/transformers/transformer_2d.py", line 391, in forward
    hidden_states = block(                       
                    ^^^^^^                       
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^         
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/attention.py", line 329, in forward
    attn_output = self.attn1(                    
                  ^^^^^^^^^^^                    
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^         
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 512, in forward
    return self.processor(                       
           ^^^^^^^^^^^^^^^                       
  File "/home/server43/jungeun_workspace/3D_Edit/gaussctrl/gaussctrl/utils.py", line 90, in __call__
    attention_probs = attn.get_attention_scores(query, key_self, attention_mask)                  
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  
  File "/home/server43/miniconda3/envs/je_3dedit/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 580, in get_attention_scores
    baddbmm_input = torch.empty(                 
                    ^^^^^^^^^^^^                 
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.15 GiB (GPU 0; 23.69 GiB total capacity; 8.79 GiB already allocated; 13.99 GiB free; 9.36 GiB reserved in total by PyTorch) If 
reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

jingwu2121 commented 1 month ago

Hi, there, i don't know if there is anything else being loaded at the GPU at the same time, causing this error. You can have a check. A way to work around is the reduce the chunk_size, try set it to 1, it doesn't affect the performance

jungeun122333 commented 1 month ago

I'm encountering the same error at the same location when I try with chunk_size=1.

The only notable difference seems to be the amount of memory required:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.39 GiB (GPU 0; 23.69 GiB total capacity; 8.49 GiB already allocated; 14.59 GiB free; 8.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Have you had a chance to download your code from Git and run it yourself? It seems unusual to allocate over 19GB at once, and I’m curious if you’ve experienced this as well.

jingwu2121 commented 1 month ago

Hi, I ran the code myself before, and it worked fine. Can you please try adding this argument? --pipeline.diffusion_ckpt "jinggogogo/gaussctrl-sd15", and also reduce the number of reference views to 2 --pipeline.ref_view_num 2.

jungeun122333 commented 1 month ago

Sorry, it was my fault. I just realized that I used non-preprocessed data. Thank you for your kind answer.

ActiveVisionLab / gaussctrl

CUDA out of memory error #4