train, 8 RTX3090, batch size=1, but CUDA out of memory

m1nt07 commented 9 months ago

Thanks for your great work! I want to follow your work to reproduce the result in paper, my hardware environment is 7 RTX 3090(24G Memory)(1 card is being used, so I use 7 cards), but I met the error "CUDA out of memory" even though I set batch size=1 in configs/config_charinpaint.yaml

This is my train command:

python train.py --base configs/config_charinpaint.yaml --stage fit --name reproduce --project diffste_repro --base_logdir logs

This is my configs/config_charinpaint.yaml:

# train on combination of all datas
data:
  target: "src.trainers.WrappedDataModule"
  batch_size: 1
  scene_data: data/ocr-dataset/
  synth_data: data/ocr-dataset/SynthText/synth/
  train:
    size: 256
    max_num: 2000000 # diffly choose this number
    augconf:
      synth:
        center: 0.1
        pad: false
      scene:
        expand_mask:
          center_mask: 0.6
          additional_mask: 0.4
        crop:
          mask_image_ratio: 15
        rotate:
          cat_prob: [1, 0, 0]
          angle_list: [-15, -30, -45, -60, -90, 15, 30, 45, 60, 90]
          rotate_range: 90

    dataconfs:
      ArT:
        type: scene
        label_path: ${data.scene_data}/ArT/train_labels.json
        image_dir: ${data.scene_data}/ArT/train_images/

      # COCO:
      #   type: scene
      #   label_path: ${data.scene_data}/COCO/cocotext.v2.json
      #   image_dir: ${data.scene_data}/COCO/train2014/

      # TextOCR:
      #   type: scene
      #   label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_train.json
      #   image_dir: ${data.scene_data}/TextOCR/train_images/

      # Synthtiger:
      #   type: synth
      #   label_path: ${data.synth_data}/train_data.csv
      #   image_dir: ${data.synth_data}/
      #   style_mode: same-same
      #   use_textbbox: false
      #   style_dropout: [0.5, 0.5]
      #   rand_mask_text: true

  validation:
    size: 256
    # max_num: 6400 # diffly choose this number
    augconf:
      synth:
        center: 1.
        pad: false
      scene:
        expand_mask:
          center_mask: 0.
          additional_mask: 0.
        crop:
          mask_image_ratio: 30
        rotate:
          cat_prob: [1, 0, 0]
          angle_list: [-15, -30, -45, -60, -90, 15, 30, 45, 60, 90]
          rotate_range: 90

    dataconfs:
      # ArT:
      #   type: scene
      #   label_path: ${data.scene_data}/ArT/val_split.json
      #   image_dir: ${data.scene_data}/ArT/train_images/

      # COCO:
      #   type: scene
      #   label_path: ${data.scene_data}/COCO/cocotext.v2.val.json
      #   image_dir: ${data.scene_data}/COCO/train2014/

      TextOCR:
        type: scene
        label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_val.json
        image_dir: ${data.scene_data}/TextOCR/train_images/

model:
  source: raw
  target: "src.trainers.CharInpaintModelWrapper"
  pretrained_model_path: runwayml/stable-diffusion-inpainting
  loss_type: MaskMSELoss
  loss_alpha: 5
  base_learning_rate: 5.0e-5
  precision: 16
  weight_decay: 0.0
  adam_epsilon: 1.0e-8
  freeze_char_embedder: false
  optimize_vae: false
  vae:

  tokenizer:
    model_max_length: 20
  char_tokenizer:
    pretrained_path: checkpoints/chartokenizer
    pad_token: " "
    unk_token: " "
    model_max_length: 20
  char_embedder:
    vocab_size: 95 # by default
    embedding_dim: 32
    max_length: 20
    padding_idx: 0
    attention_head_dim: 2
  unet:
    attention_head_dim: { "text": 8, "char": 2 }
    cross_attention_dim: { "text": 768, "char": 32 }
  noise_scheduler: diffusers.DDIMScheduler

lightning:
  logger:
  callbacks:
    checkpoint_callback:
      params:
        save_top_k: -1
    image_logger:
      target: "src.trainers.CharInpaintImageLogger"
      params:
        # train_batch_frequency: 2400
        # valid_batch_frequency: 500
        train_batch_frequency: 2
        valid_batch_frequency: 2
        disable_wandb: true
        generation_kwargs:
          num_inference_steps: 30
          num_sample_per_image: 3
          guidance_scale: 7.5
          seed: 42

    # NOTE: Download pretrained ABINet model from https://github.com/FangShancheng/ABINet.git and
    #       put model checkpoints in checkpoints/abinet to use this callback
    # ocracc_logger:
    #   target: "src.trainers.OCRAccLogger"
    #   params:
    #     train_eval_conf:
    #       size: 256
    #       augconf: ${data.validation.augconf}
    #       max_num: 5
    #       dataconfs:
    #         TextOCR:
    #           type: scene
    #           label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_train.json
    #           image_dir: ${data.scene_data}/TextOCR/train_images/
    #           len_counter:
    #             eachnum: 10

    #     val_eval_conf:
    #       size: 256
    #       augconf: ${data.validation.augconf}
    #       max_num: 5
    #       dataconfs:
    #         TextOCR:
    #           type: scene
    #           label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_val.json
    #           image_dir: ${data.scene_data}/TextOCR/train_images/
    #           max_num: 1000
    #     base_log_dir: ${base_log_dir}/ocrlogs # will be set in code

  trainer:
    accelerator: gpu
    devices: [0, 1, 2, 3, 4, 5, 6,]
    strategy: ddp
    amp_backend: native
    log_every_n_steps: 16 # this is global step
    precision: 16
    max_epochs: 15
    check_val_every_n_epoch: 1
    accumulate_grad_batches: 8
    gradient_clip_val: 3.
    gradient_clip_algorithm: norm
    benchmark: true

This is the training log:

Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:04<00:04,  4.00s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:04<00:00,  2.06s/it]

Training: 0it [00:00, ?it/s]***** Start training *****
  Num examples = 33917
  Num Epochs = 15
  Total GPU device number: 7
  Gradient Accumulation steps = 8
  Instant batch size: 392
  Total train batch size (w. parallel, distributed & accumulation) = 56
  Total optimization steps = 9090

Training:   0%|          | 0/14290 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14290 [00:00<?, ?it/s] 
Epoch 0:   0%|          | 1/14290 [00:02<11:16:10,  2.84s/it]
Epoch 0:   0%|          | 1/14290 [00:02<11:16:21,  2.84s/it, loss=0.408, v_num=0, train_traj/loss=0.456]Log images at: train/0

Epoch 0:   0%|          | 2/14290 [00:10<20:23:39,  5.14s/it, loss=0.408, v_num=0, train_traj/loss=0.456]
Epoch 0:   0%|          | 2/14290 [00:10<20:27:57,  5.16s/it, loss=0.28, v_num=0, train_traj/loss=0.169] 
Epoch 0:   0%|          | 3/14290 [00:11<14:55:59,  3.76s/it, loss=0.28, v_num=0, train_traj/loss=0.169]
Epoch 0:   0%|          | 3/14290 [00:11<14:56:02,  3.76s/it, loss=0.201, v_num=0, train_traj/loss=0.0412]Log images at: train/0

Epoch 0:   0%|          | 4/14290 [00:15<15:32:40,  3.92s/it, loss=0.201, v_num=0, train_traj/loss=0.0412]
Epoch 0:   0%|          | 4/14290 [00:15<15:34:49,  3.93s/it, loss=0.157, v_num=0, train_traj/loss=0.0283]
Epoch 0:   0%|          | 5/14290 [00:16<12:42:16,  3.20s/it, loss=0.157, v_num=0, train_traj/loss=0.0283]
Epoch 0:   0%|          | 5/14290 [00:16<12:42:18,  3.20s/it, loss=0.13, v_num=0, train_traj/loss=0.0235] Log images at: train/0

Epoch 0:   0%|          | 6/14290 [00:20<13:32:52,  3.41s/it, loss=0.13, v_num=0, train_traj/loss=0.0235]
Epoch 0:   0%|          | 6/14290 [00:20<13:34:06,  3.42s/it, loss=0.14, v_num=0, train_traj/loss=0.301] 
Epoch 0:   0%|          | 7/14290 [00:20<11:49:38,  2.98s/it, loss=0.14, v_num=0, train_traj/loss=0.301]
Epoch 0:   0%|          | 7/14290 [00:20<11:49:40,  2.98s/it, loss=0.202, v_num=0, train_traj/loss=0.292]
Training failed due to CUDA out of memory. Tried to allocate 192.00 MiB (GPU 4; 23.69 GiB total capacity; 21.78 GiB already allocated; 74.94 MiB free; 22.17 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I found that error occurs when iters achieve 7, it‘s related to "Gradient Accumulation steps = 8", maybe backward consumes too much memory? Have you ever met this problem? Or what should I do to solve this problem? Thanks a lot.

Question406 commented 9 months ago

Hi, i tested the training on a A6000 GPU with CUDA 12.1 and torch 2.1, and it takes 23.61GB memory. I guess changing CUDA version is worth a try to reduce the memory usage.

m1nt07 commented 9 months ago

Thank you, I'll try it.

m1nt07 commented 7 months ago

Unfortunately, it didn't work when only changing CUDA version. I tried to use checkpoint method in pytorch to reduce vram, some codes were changed as below:

self.enable_gradient_checkpointing() was added in src/model/unet_2d_multicondition.py

Line 421 was modified, src/model/unet_2d_blocks.py

        if self.training and self.gradient_checkpointing:

            # def create_custom_forward(module, return_dict=None):
            #     def custom_forward(*inputs):
            #         if return_dict is not None:
            #             return module(*inputs, return_dict=return_dict)
            #         else:
            #             return module(*inputs)
            def create_custom_forward(module):
                def custom_forward(*inputs):
                    return module(*inputs)

                return custom_forward

            hidden_states = torch.utils.checkpoint.checkpoint(
                create_custom_forward(resnet), hidden_states, temb)
            cond_hidden_states = {
                k: torch.utils.checkpoint.checkpoint(
                    create_custom_forward(
                        # attns[k], return_dict=False), hidden_states,
                        attns[k]), hidden_states,
                    encoder_hidden_states[k]
                )[0] for k in attns.keys() if encoder_hidden_states[k] is not None
            }
            hidden_states = torch.mean(torch.stack(
                list(cond_hidden_states.values())), dim=0)

Finally, the repo can be trained on RTX 3090.

However, the reproduced results do not look satisfactory. Referring to the issue checkpoints/new_tunedvae, I tried three methods:

'Initialize vae from pretrained...'. Then, train the UNet.
First, save the new_tunedvae from diffste.ckpt. Second, 'Initialize vae from finetuned...', and modify the configs/config_charinpaint.yaml as below. Then, train the UNet.
```
vae:
normalizer: 0.21966682713
pretrained_model_path: checkpoints/new_tunedvae
```
First, save the new_tunedvae from diffste.ckpt. Second, 'Initialize vae from finetuned...', and modify the configs/config_charinpaint.yaml as below. Then, train the UNet and VAE.
```
optimize_vae: true
vae:
normalizer: 0.21966682713
pretrained_model_path: checkpoints/new_tunedvae
```

But none of these results look good. One generated sample is shown below (seed=12897398647) 0-wizards-grid_finetune_vae