UCSB-NLP-Chang / DiffSTE

MIT License
87 stars 8 forks source link

train, 8 RTX3090, batch size=1, but CUDA out of memory #17

Closed m1nt07 closed 6 months ago

m1nt07 commented 9 months ago

Thanks for your great work! I want to follow your work to reproduce the result in paper, my hardware environment is 7 RTX 3090(24G Memory)(1 card is being used, so I use 7 cards), but I met the error "CUDA out of memory" even though I set batch size=1 in configs/config_charinpaint.yaml

This is my train command:

python train.py --base configs/config_charinpaint.yaml --stage fit --name reproduce --project diffste_repro --base_logdir logs

This is my configs/config_charinpaint.yaml:

# train on combination of all datas
data:
  target: "src.trainers.WrappedDataModule"
  batch_size: 1
  scene_data: data/ocr-dataset/
  synth_data: data/ocr-dataset/SynthText/synth/
  train:
    size: 256
    max_num: 2000000 # diffly choose this number
    augconf:
      synth:
        center: 0.1
        pad: false
      scene:
        expand_mask:
          center_mask: 0.6
          additional_mask: 0.4
        crop:
          mask_image_ratio: 15
        rotate:
          cat_prob: [1, 0, 0]
          angle_list: [-15, -30, -45, -60, -90, 15, 30, 45, 60, 90]
          rotate_range: 90

    dataconfs:
      ArT:
        type: scene
        label_path: ${data.scene_data}/ArT/train_labels.json
        image_dir: ${data.scene_data}/ArT/train_images/

      # COCO:
      #   type: scene
      #   label_path: ${data.scene_data}/COCO/cocotext.v2.json
      #   image_dir: ${data.scene_data}/COCO/train2014/

      # TextOCR:
      #   type: scene
      #   label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_train.json
      #   image_dir: ${data.scene_data}/TextOCR/train_images/

      # Synthtiger:
      #   type: synth
      #   label_path: ${data.synth_data}/train_data.csv
      #   image_dir: ${data.synth_data}/
      #   style_mode: same-same
      #   use_textbbox: false
      #   style_dropout: [0.5, 0.5]
      #   rand_mask_text: true

  validation:
    size: 256
    # max_num: 6400 # diffly choose this number
    augconf:
      synth:
        center: 1.
        pad: false
      scene:
        expand_mask:
          center_mask: 0.
          additional_mask: 0.
        crop:
          mask_image_ratio: 30
        rotate:
          cat_prob: [1, 0, 0]
          angle_list: [-15, -30, -45, -60, -90, 15, 30, 45, 60, 90]
          rotate_range: 90

    dataconfs:
      # ArT:
      #   type: scene
      #   label_path: ${data.scene_data}/ArT/val_split.json
      #   image_dir: ${data.scene_data}/ArT/train_images/

      # COCO:
      #   type: scene
      #   label_path: ${data.scene_data}/COCO/cocotext.v2.val.json
      #   image_dir: ${data.scene_data}/COCO/train2014/

      TextOCR:
        type: scene
        label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_val.json
        image_dir: ${data.scene_data}/TextOCR/train_images/

model:
  source: raw
  target: "src.trainers.CharInpaintModelWrapper"
  pretrained_model_path: runwayml/stable-diffusion-inpainting
  loss_type: MaskMSELoss
  loss_alpha: 5
  base_learning_rate: 5.0e-5
  precision: 16
  weight_decay: 0.0
  adam_epsilon: 1.0e-8
  freeze_char_embedder: false
  optimize_vae: false
  vae:

  tokenizer:
    model_max_length: 20
  char_tokenizer:
    pretrained_path: checkpoints/chartokenizer
    pad_token: " "
    unk_token: " "
    model_max_length: 20
  char_embedder:
    vocab_size: 95 # by default
    embedding_dim: 32
    max_length: 20
    padding_idx: 0
    attention_head_dim: 2
  unet:
    attention_head_dim: { "text": 8, "char": 2 }
    cross_attention_dim: { "text": 768, "char": 32 }
  noise_scheduler: diffusers.DDIMScheduler

lightning:
  logger:
  callbacks:
    checkpoint_callback:
      params:
        save_top_k: -1
    image_logger:
      target: "src.trainers.CharInpaintImageLogger"
      params:
        # train_batch_frequency: 2400
        # valid_batch_frequency: 500
        train_batch_frequency: 2
        valid_batch_frequency: 2
        disable_wandb: true
        generation_kwargs:
          num_inference_steps: 30
          num_sample_per_image: 3
          guidance_scale: 7.5
          seed: 42

    # NOTE: Download pretrained ABINet model from https://github.com/FangShancheng/ABINet.git and
    #       put model checkpoints in checkpoints/abinet to use this callback
    # ocracc_logger:
    #   target: "src.trainers.OCRAccLogger"
    #   params:
    #     train_eval_conf:
    #       size: 256
    #       augconf: ${data.validation.augconf}
    #       max_num: 5
    #       dataconfs:
    #         TextOCR:
    #           type: scene
    #           label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_train.json
    #           image_dir: ${data.scene_data}/TextOCR/train_images/
    #           len_counter:
    #             eachnum: 10

    #     val_eval_conf:
    #       size: 256
    #       augconf: ${data.validation.augconf}
    #       max_num: 5
    #       dataconfs:
    #         TextOCR:
    #           type: scene
    #           label_path: ${data.scene_data}/TextOCR/TextOCR_0.1_val.json
    #           image_dir: ${data.scene_data}/TextOCR/train_images/
    #           max_num: 1000
    #     base_log_dir: ${base_log_dir}/ocrlogs # will be set in code

  trainer:
    accelerator: gpu
    devices: [0, 1, 2, 3, 4, 5, 6,]
    strategy: ddp
    amp_backend: native
    log_every_n_steps: 16 # this is global step
    precision: 16
    max_epochs: 15
    check_val_every_n_epoch: 1
    accumulate_grad_batches: 8
    gradient_clip_val: 3.
    gradient_clip_algorithm: norm
    benchmark: true

This is the training log:

Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:04<00:04,  4.00s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:04<00:00,  2.06s/it]

Training: 0it [00:00, ?it/s]***** Start training *****
  Num examples = 33917
  Num Epochs = 15
  Total GPU device number: 7
  Gradient Accumulation steps = 8
  Instant batch size: 392
  Total train batch size (w. parallel, distributed & accumulation) = 56
  Total optimization steps = 9090

Training:   0%|          | 0/14290 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14290 [00:00<?, ?it/s] 
Epoch 0:   0%|          | 1/14290 [00:02<11:16:10,  2.84s/it]
Epoch 0:   0%|          | 1/14290 [00:02<11:16:21,  2.84s/it, loss=0.408, v_num=0, train_traj/loss=0.456]Log images at: train/0

Epoch 0:   0%|          | 2/14290 [00:10<20:23:39,  5.14s/it, loss=0.408, v_num=0, train_traj/loss=0.456]
Epoch 0:   0%|          | 2/14290 [00:10<20:27:57,  5.16s/it, loss=0.28, v_num=0, train_traj/loss=0.169] 
Epoch 0:   0%|          | 3/14290 [00:11<14:55:59,  3.76s/it, loss=0.28, v_num=0, train_traj/loss=0.169]
Epoch 0:   0%|          | 3/14290 [00:11<14:56:02,  3.76s/it, loss=0.201, v_num=0, train_traj/loss=0.0412]Log images at: train/0

Epoch 0:   0%|          | 4/14290 [00:15<15:32:40,  3.92s/it, loss=0.201, v_num=0, train_traj/loss=0.0412]
Epoch 0:   0%|          | 4/14290 [00:15<15:34:49,  3.93s/it, loss=0.157, v_num=0, train_traj/loss=0.0283]
Epoch 0:   0%|          | 5/14290 [00:16<12:42:16,  3.20s/it, loss=0.157, v_num=0, train_traj/loss=0.0283]
Epoch 0:   0%|          | 5/14290 [00:16<12:42:18,  3.20s/it, loss=0.13, v_num=0, train_traj/loss=0.0235] Log images at: train/0

Epoch 0:   0%|          | 6/14290 [00:20<13:32:52,  3.41s/it, loss=0.13, v_num=0, train_traj/loss=0.0235]
Epoch 0:   0%|          | 6/14290 [00:20<13:34:06,  3.42s/it, loss=0.14, v_num=0, train_traj/loss=0.301] 
Epoch 0:   0%|          | 7/14290 [00:20<11:49:38,  2.98s/it, loss=0.14, v_num=0, train_traj/loss=0.301]
Epoch 0:   0%|          | 7/14290 [00:20<11:49:40,  2.98s/it, loss=0.202, v_num=0, train_traj/loss=0.292]
Training failed due to CUDA out of memory. Tried to allocate 192.00 MiB (GPU 4; 23.69 GiB total capacity; 21.78 GiB already allocated; 74.94 MiB free; 22.17 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I found that error occurs when iters achieve 7, it‘s related to "Gradient Accumulation steps = 8", maybe backward consumes too much memory? Have you ever met this problem? Or what should I do to solve this problem? Thanks a lot.

Question406 commented 9 months ago

Hi, i tested the training on a A6000 GPU with CUDA 12.1 and torch 2.1, and it takes 23.61GB memory. I guess changing CUDA version is worth a try to reduce the memory usage.

m1nt07 commented 9 months ago

Thank you, I'll try it.

m1nt07 commented 7 months ago

Unfortunately, it didn't work when only changing CUDA version. I tried to use checkpoint method in pytorch to reduce vram, some codes were changed as below:


However, the reproduced results do not look satisfactory. Referring to the issue checkpoints/new_tunedvae, I tried three methods:

  1. 'Initialize vae from pretrained...'. Then, train the UNet.
  2. First, save the new_tunedvae from diffste.ckpt. Second, 'Initialize vae from finetuned...', and modify the configs/config_charinpaint.yaml as below. Then, train the UNet.
    vae:
    normalizer: 0.21966682713
    pretrained_model_path: checkpoints/new_tunedvae
  3. First, save the new_tunedvae from diffste.ckpt. Second, 'Initialize vae from finetuned...', and modify the configs/config_charinpaint.yaml as below. Then, train the UNet and VAE.
    optimize_vae: true
    vae:
    normalizer: 0.21966682713
    pretrained_model_path: checkpoints/new_tunedvae

But none of these results look good. One generated sample is shown below (seed=12897398647) 0-wizards-grid_finetune_vae

Any advice on this problem? Thanks a lot.

kd-scki3011 commented 7 months ago

I'm having the same problem as you, I'm also training on RTX3090, what exactly did you modify that part of the code?

m1nt07 commented 7 months ago

I'm having the same problem as you, I'm also training on RTX3090, what exactly did you modify that part of the code?

Just like what was mentioned in the previous answer, two changes were made:

  1. At the end of the __init__ code, at the position in src/model/unet_2d_multicondition.py, the code self.enable_gradient_checkpointing() was added.
  2. In src/model/unet_2d_blocks.py, using the create_custom_forward function from the original code caused an error, so it was modified to the code mentioned above. With these adjustments, training could proceed.
kd-scki3011 commented 7 months ago

image like this? Thank you very much, I followed the same method as you and can already train. It's just that I only have one RTX3090, and I encountered this situation at the beginning of training, is it normal?

image

kd-scki3011 commented 7 months ago

I'm having the same problem as you, I'm also training on RTX3090, what exactly did you modify that part of the code?

Just like what was mentioned in the previous answer, two changes were made:

  1. At the end of the init code, at the position in src/model/unet_2d_multicondition.py, the code self.enable_gradient_checkpointing() was added.
  2. In src/model/unet_2d_blocks.py, using the create_custom_forward function from the original code caused an error, so it was modified to the code mentioned above. With these adjustments, training could proceed.

https://github.com/UCSB-NLP-Chang/DiffSTE/issues/17#issuecomment-1976206836

kd-scki3011 commented 7 months ago

I'm having the same problem as you, I'm also training on RTX3090, what exactly did you modify that part of the code?

Just like what was mentioned in the previous answer, two changes were made:

  1. At the end of the init code, at the position in src/model/unet_2d_multicondition.py, the code self.enable_gradient_checkpointing() was added.
  2. In src/model/unet_2d_blocks.py, using the create_custom_forward function from the original code caused an error, so it was modified to the code mentioned above. With these adjustments, training could proceed.

#17 (comment)

I put runwayml/stable-diffusion-inpainting locally: image

kd-scki3011 commented 7 months ago

Have you guys encountered this issue: AttributeError: 'UNet2DConditionModel' object has no attribute 'encoder'

m1nt07 commented 7 months ago

@Question406 @kd-scki3011 After training on 2 A100 cards (80G memory) without using the checkpoint method, the results were as expected. So, there might be some bugs with the checkpoint method.