I'm trying to train the model with a custom dataset on 4 a6000(49GB each) gpus but it takes 27GB each when training the model with batchsize 1
here is my config file and gpu status
`model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: "image_target"
cond_stage_key: "image_cond"
image_size: 32
channels: 4
cond_stage_trainable: false # Note: different from the one we trained before
conditioning_key: hybrid
monitor: val/loss_simple_ema
scale_factor: 0.18215
I'm trying to train the model with a custom dataset on 4 a6000(49GB each) gpus but it takes 27GB each when training the model with batchsize 1 here is my config file and gpu status `model: base_learning_rate: 1.0e-04 target: ldm.models.diffusion.ddpm.LatentDiffusion params: linear_start: 0.00085 linear_end: 0.0120 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: "image_target" cond_stage_key: "image_cond" image_size: 32 channels: 4 cond_stage_trainable: false # Note: different from the one we trained before conditioning_key: hybrid monitor: val/loss_simple_ema scale_factor: 0.18215
data: target: ldm.data.simple.ObjaverseDataModuleFromConfig params: root_dir: my_path batch_size: 1 num_workers: 8 total_view: 4 train: validation: False image_transforms: size: 256 validation: validation: True image_transforms: size: 256 lightning: find_unused_parameters: false metrics_over_trainsteps_checkpoint: True modelcheckpoint: params: every_n_train_steps: 5000 callbacks: image_logger: target: main.ImageLogger params: batch_frequency: 500 max_images: 32 increase_log_steps: False log_first_step: True log_images_kwargs: use_ema_scope: False inpaint: False plot_progressive_rows: False plot_diffusion_rows: False N: 32 unconditional_guidance_scale: 3.0 unconditional_guidance_label: [""] trainer: benchmark: True val_check_interval: 5000000 # really sorry num_sanity_val_steps: 0 accumulate_grad_batches: 5
Wed Apr 24 06:47:00 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:1D:00.0 Off | Off | | 48% 71C P2 203W / 300W | 27238MiB / 49140MiB | 92% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 Off | 00000000:1E:00.0 Off | Off | | 46% 70C P2 204W / 300W | 27242MiB / 49140MiB | 93% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off | | 49% 73C P2 202W / 300W | 27242MiB / 49140MiB | 94% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 Off | 00000000:20:00.0 Off | Off | | 47% 70C P2 194W / 300W | 27222MiB / 49140MiB | 94% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+` Is it normal for batch size 1 to consume this much GPU?