Out of GPU Memory when runing train_3D_diffusion.py

thucz commented 1 year ago

Hi! DFM is a great work! I'm trying it for my research.

But when I ran the following command on 4 A100(40G) GPU cards, I got the out of GPU memory error. I have revised the "batch_size" to 1 * ngpus in the get_train_settings function of train_3D_diffusion.py. This error still appears. Do you know how to fix it?

ngpus=4

torchrun  --nnodes 1 --nproc_per_node $ngpus experiment_scripts/train_3D_diffusion.py dataset=realestate setting_name=re name=re10k mode=cond feats_cond=true wandb=local ngpus=$ngpus use_guidance=true image_size=64

The log is:

......
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report                          [37/1785]
    raise ex
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/group/30042/ozhengchen/pano_aigc/DFM/experiment_scripts/train_3D_diffusion.py", line 109, in train
    trainer.train()
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 1218, in train
    losses, misc = self.model(data, render_video=render_video)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 905, in forward
    return self.p_losses(inp, t, *args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 722, in p_losses
    model_out, depth, misc = self.model(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_model_cond.py", line 675, in forward
    rgbfeats, depth, misc = self.renderer(trgt_c2w, intrinsics, new_xy, rf)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/renderer.py", line 345, in forward
    sigma_all, feats_all, _ = radiance_field(pts_all, viewdirs_all, fine=True)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_model_cond.py", line 722, in <lambda>
    return lambda x, v, fine: self.pixelNeRF_joint(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_helpers.py", line 277, in forward
    mlp_output = self.mlp_fine(mlp_in, ns=num_context, time_emb=t)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/resnetfc_time_embed.py", line 246, in forward
    x = self.blocks[blkid](x, time_emb=time_emb)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/resnetfc_time_embed.py", line 94, in forward
    return x_s + dx
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 39.59 GiB total capacity; 36.42 GiB already allocated; 191.19 MiB free; 36.74 GiB reserved 
in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_AL
LOC_CONF

1ssb commented 1 year ago

Are you facing the same issue for fine tuning?

thucz commented 1 year ago

Are you facing the same issue for fine tuning?

Yes. The same issue appears.

1ssb commented 1 year ago

I ran the fine-tuning as well, but it seems it's too large even for 4 A6000s. Any help?

On Tue, 24 Oct, 2023, 9:28 pm thucz, @.***> wrote:

Are you facing the same issue for fine tuning?

Yes. The same issue appears.

— Reply to this email directly, view it on GitHub https://github.com/ayushtewari/DFM/issues/4#issuecomment-1776941409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEAU63XOMANY6SJMDUTYA6J4PAVCNFSM6AAAAAA6KYFF5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWHE2DCNBQHE . You are receiving this because you commented.Message ID: @.***>

lukasHoel commented 1 year ago

I run into the same issue when training the second stage (128x128 image resolution) on 1xA100 80 GB GPU. I set the batch-size to 1. The first iteration works (also backward + opt.step). Afterwards, I get OOM in the second iteration somewhere in the model forward pass.

Adding torch.cuda.empty_cache() after every iteration did not solve the issue.
I tried to change the model hparam num_pixels=int(18**2), n_coarse=32, n_fine=32. However, it still fails with OOM.
I tried to change image_size=92. However, it still fails with OOM.
Even with image_size=64, it already shows 73GB of memory in use. Probably that's why it did not fit into the ~40GB a6000/a100 GPUs as reported by @thucz @1ssb.

@tianweiy do you have any idea what could be going on?

tianweiy commented 1 year ago

could you share the full command? the ngpus argument will also adapt the batch size

lukasHoel commented 1 year ago

python experiment_scripts/train_3D_diffusion.py \
use_abs_pose=true \
dataset=CO3D \
lr=2e-5 \
ngpus=1 \
setting_name=co3d_3ctxt \
feats_cond=True \
dataset.lpips_loss_weight=0.2 \
name=co3d_128res \
scale_aug_ratio=0.2 \
image_size=128 \
checkpoint_path=... \

tianweiy commented 1 year ago

@ayushtewari

lukasHoel commented 1 year ago

And in the logs it prints this, so I guess batch-size is correctly set to 1

using settings {'n_coarse': 64, 'n_fine': 64, 'n_coarse_coarse': 64, 'n_coarse_fine': 0, 'num_pixels': 576, 'batch_size': 1, 'num_context':
 3, 'num_target': 2, 'n_feats_out': 64, 'use_viewdir': False, 'sampling': 'patch', 'self_condition': False, 'cnn_refine': False, 'lindisp':
 False}

tianweiy commented 1 year ago

could you try setting this https://github.com/ayushtewari/DFM/blob/50c6e20db124147f37ba44b256000de6ce524270/experiment_scripts/train_3D_diffusion.py#L117 to 32?

lukasHoel commented 1 year ago

I set all of these values to 32 (n_*=32). Now it got 6 iterations in, but then again failed with OOM in the model forward pass.

{'n_coarse': 32, 'n_fine': 32, 'n_coarse_coarse': 32, 'n_coarse_fine': 0, 'num_pixels': 576, 'batch_size': 1, 'num_context':
 3, 'num_target': 2, 'n_feats_out': 64, 'use_viewdir': False, 'sampling': 'patch', 'self_condition': False, 'cnn_refine': False, 'lindisp':
 False}

loss: 0.4962:   0%|                                                                                 | 6/100000 [01:51<516:21:37, 18.59s/it]

...OOM Error...

tianweiy commented 1 year ago

how about this one

    return {
        "n_coarse": 64,
        "n_fine": 64,
        "n_coarse_coarse": 32,
        "n_coarse_fine": 0,
        "num_pixels": int(18 ** 2),
        "batch_size": 1 * ngpus,
        "num_context": 2,
        "num_target": 2,
        "n_feats_out": 64,
        "use_viewdir": False,
        "sampling": "patch",
        # "lindisp": True,
    }

tianweiy commented 1 year ago

so a few things that matter the most are "n_coarse_coarse" and "num_pixels" and "num_context"

lukasHoel commented 1 year ago

Yes this config seems to work, thanks! Btw, this is still with batch-size=1 (instead of 3). Let's see if this still converges to equally good results :)

1ssb commented 1 year ago

I tried this on new data with a smaller size of data on 4 a6000s but the depth maps are not consistent for another set of data for finetuning. It completely loses semantic consistency as well. Hi @lukasHoel, please post here how it went for you for a different distribution of data beyond RealEstate. Important to note I finetuned, did not retrain (I do not have the juice to).

On Thu, 23 Nov, 2023, 1:42 am Lukas Hoellein, @.***> wrote:

Yes this config seems to work, thanks! Btw, this is still with batch-size=1 (instead of 3). Let's see if this still converges to equally good results :)

— Reply to this email directly, view it on GitHub https://github.com/ayushtewari/DFM/issues/4#issuecomment-1822902345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFECTHB2C7PJYYGOZMXDYFYFOVAVCNFSM6AAAAAA6KYFF5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSHEYDEMZUGU . You are receiving this because you were mentioned.Message ID: @.***>

ayushtewari / DFM

Out of GPU Memory when runing train_3D_diffusion.py #4