Open thucz opened 1 year ago
Are you facing the same issue for fine tuning?
Are you facing the same issue for fine tuning?
Yes. The same issue appears.
I ran the fine-tuning as well, but it seems it's too large even for 4 A6000s. Any help?
On Tue, 24 Oct, 2023, 9:28 pm thucz, @.***> wrote:
Are you facing the same issue for fine tuning?
Yes. The same issue appears.
— Reply to this email directly, view it on GitHub https://github.com/ayushtewari/DFM/issues/4#issuecomment-1776941409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFEAU63XOMANY6SJMDUTYA6J4PAVCNFSM6AAAAAA6KYFF5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWHE2DCNBQHE . You are receiving this because you commented.Message ID: @.***>
I run into the same issue when training the second stage (128x128 image resolution) on 1xA100 80 GB GPU. I set the batch-size to 1. The first iteration works (also backward + opt.step). Afterwards, I get OOM in the second iteration somewhere in the model forward pass.
Adding torch.cuda.empty_cache()
after every iteration did not solve the issue.
I tried to change the model hparam num_pixels=int(18**2), n_coarse=32, n_fine=32
. However, it still fails with OOM.
I tried to change image_size=92
. However, it still fails with OOM.
Even with image_size=64
, it already shows 73GB of memory in use. Probably that's why it did not fit into the ~40GB a6000/a100 GPUs as reported by @thucz @1ssb.
@tianweiy do you have any idea what could be going on?
could you share the full command? the ngpus argument will also adapt the batch size
python experiment_scripts/train_3D_diffusion.py \
use_abs_pose=true \
dataset=CO3D \
lr=2e-5 \
ngpus=1 \
setting_name=co3d_3ctxt \
feats_cond=True \
dataset.lpips_loss_weight=0.2 \
name=co3d_128res \
scale_aug_ratio=0.2 \
image_size=128 \
checkpoint_path=... \
@ayushtewari
And in the logs it prints this, so I guess batch-size is correctly set to 1
using settings {'n_coarse': 64, 'n_fine': 64, 'n_coarse_coarse': 64, 'n_coarse_fine': 0, 'num_pixels': 576, 'batch_size': 1, 'num_context':
3, 'num_target': 2, 'n_feats_out': 64, 'use_viewdir': False, 'sampling': 'patch', 'self_condition': False, 'cnn_refine': False, 'lindisp':
False}
could you try setting this https://github.com/ayushtewari/DFM/blob/50c6e20db124147f37ba44b256000de6ce524270/experiment_scripts/train_3D_diffusion.py#L117 to 32?
I set all of these values to 32 (n_*=32
). Now it got 6 iterations in, but then again failed with OOM in the model forward pass.
{'n_coarse': 32, 'n_fine': 32, 'n_coarse_coarse': 32, 'n_coarse_fine': 0, 'num_pixels': 576, 'batch_size': 1, 'num_context':
3, 'num_target': 2, 'n_feats_out': 64, 'use_viewdir': False, 'sampling': 'patch', 'self_condition': False, 'cnn_refine': False, 'lindisp':
False}
loss: 0.4962: 0%| | 6/100000 [01:51<516:21:37, 18.59s/it]
...OOM Error...
how about this one
return {
"n_coarse": 64,
"n_fine": 64,
"n_coarse_coarse": 32,
"n_coarse_fine": 0,
"num_pixels": int(18 ** 2),
"batch_size": 1 * ngpus,
"num_context": 2,
"num_target": 2,
"n_feats_out": 64,
"use_viewdir": False,
"sampling": "patch",
# "lindisp": True,
}
so a few things that matter the most are "n_coarse_coarse" and "num_pixels" and "num_context"
Yes this config seems to work, thanks! Btw, this is still with batch-size=1
(instead of 3). Let's see if this still converges to equally good results :)
I tried this on new data with a smaller size of data on 4 a6000s but the depth maps are not consistent for another set of data for finetuning. It completely loses semantic consistency as well. Hi @lukasHoel, please post here how it went for you for a different distribution of data beyond RealEstate. Important to note I finetuned, did not retrain (I do not have the juice to).
On Thu, 23 Nov, 2023, 1:42 am Lukas Hoellein, @.***> wrote:
Yes this config seems to work, thanks! Btw, this is still with batch-size=1 (instead of 3). Let's see if this still converges to equally good results :)
— Reply to this email directly, view it on GitHub https://github.com/ayushtewari/DFM/issues/4#issuecomment-1822902345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWHFECTHB2C7PJYYGOZMXDYFYFOVAVCNFSM6AAAAAA6KYFF5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSHEYDEMZUGU . You are receiving this because you were mentioned.Message ID: @.***>
Hi! DFM is a great work! I'm trying it for my research.
But when I ran the following command on 4 A100(40G) GPU cards, I got the out of GPU memory error. I have revised the "batch_size" to 1 * ngpus in the get_train_settings function of train_3D_diffusion.py. This error still appears. Do you know how to fix it?
The log is: