Use gradient accumulation during training to address CUDA out of memory issues?

sumorday commented 7 months ago

I am attempting to use gradient accumulation during training to address CUDA out of memory issues. I referred to the example code in the Accelerate documentation as well as the example code in the Accelerate repository and attempted to incorporate gradient accumulation into my training process. However, I encountered some confusion and would appreciate some guidance and advice.

Issue Details:

The original training batch size is 112(ClebA), but due to CUDA out of memory issues, I would like to reduce the batch size. I attempted to halve the batch size to 56 and set the gradient accumulation to 2 to maintain the same number of training steps(56 batchsize*2 gradient step =112 batchsize as well). However, I found that the final batch size still equals 112, which did not alleviate my CUDA memory pressure.

I would like to receive some guidance on:

How to properly configure gradient accumulation to ensure that the final batch size does not exceed the value I set? Do I need to adjust other parameters or other parts of the training process to accommodate the use of gradient accumulation? I attempted to reference the following resources:

Gradient Accumulation example in the Accelerate documentation

Gradient Accumulation example code in the Accelerate (repository) However, I still feel confused and would appreciate some guidance. Thank you for your assistance!

Tôi đang gặp vấn đề với việc cấu hình tích lũy gradient trong quá trình huấn luyện để giảm áp lực về bộ nhớ CUDA. Mặc dù tôi đã tham khảo các tài liệu và ví dụ từ tài liệu Accelerate, nhưng khi tôi cố gắng giảm batch size xuống 56 và đặt tích lũy gradient là 2, batch size cuối cùng vẫn là 112. Tôi cần sự hướng dẫn về cách cấu hình tích lũy gradient một cách chính xác để đảm bảo batch size cuối cùng không vượt quá giá trị tôi đặt và có cần điều chỉnh các thông số khác trong quá trình huấn luyện không. Cảm ơn bạn

hao-pt commented 6 months ago

First of all, it is not restricted to set bs=112 to train on CelebA so you can try to lower the batch size to smaller than 112 (e.g. 32). Alternatively, you can adjust bs lower than 56 and set grad_accum correspondingly like bs=28, grad_accum=4. Generally, please feel free to adjust bs based on your GPU's MEM.

sumorday commented 6 months ago

First of all, it is not restricted to set bs=112 to train on CelebA so you can try to lower the batch size to smaller than 112 (e.g. 32). Alternatively, you can adjust bs lower than 56 and set grad_accum correspondingly like bs=28, grad_accum=4. Generally, please feel free to adjust bs based on your GPU's MEM.

Thank you very much. Based on the source code, I know that the modification should be made in run.sh:

                           ADM ~ CelebA 256

accelerate launch --num_processes 1 train_flow_latent.py --exp celeb256_f8_adm \
     --dataset celeba_256 --datadir ../cnf_flow/data/celeba/celeba-lmdb \
     --batch_size 112 or 56 or 28 --num_epoch 500 \ #I think it should be modified here, right?
     --image_size 256 --f 8 --num_in_channels 4 --num_out_channels 4 \
     --nf 256 --ch_mult 1 2 2 2 --attn_resolution 16 8 --num_res_blocks 2 \
     --lr 2e-5 --scale_factor 0.18215 \
     --save_content --save_content_every 10 \
     --use_origin_adm

But where should gradient accumulation be added? Do I need to modify the code? Tôi cần phải sửa lại vòng lặp epoch trong tập tin train_flow_latent.py để thêm tính chất tích luỹ độ dốc không? Tôi đã thử một lần và thấy nó hơi phức tạp. Tôi nhớ rằng bộ tăng tốc có thể tự động thêm tính chất tích luỹ độ dốc = 2, nhưng dường như vẫn không hiệu quả. (Tất nhiên, việc thay đổi kích thước lô hàng trực tiếp là phương pháp hiệu quả và đơn giản, chỉ cần thay đổi từ 112 thành 56 hoặc 28 là được, nhưng dường như không thể thực hiện so sánh công bằng với benchmark.)

hao-pt commented 6 months ago

Yes, you need to modify the training code for grad accumulation. Specifically, you need to add args in L52 as follows accelerator = Accelerator(gradient_accumulation_steps=2). And wrap the training code in with accelerator.accumulate(model): as in this tutorial https://huggingface.co/docs/accelerate/en/usage_guides/gradient_accumulation.

sumorday commented 6 months ago

Yes, you need to modify the training code for grad accumulation. Specifically, you need to add args in L52 as follows accelerator = Accelerator(gradient_accumulation_steps=2). And wrap the training code in with accelerator.accumulate(model): as in this tutorial https://huggingface.co/docs/accelerate/en/usage_guides/gradient_accumulation.

    # Setup accelerator:
    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps) #Tham số truyền mặc định là 2.
    effective_batch_size = args.batch_size // args.gradient_accumulation_steps

    #Nếu kích thước batch là 32 và tích lũy gradient = 2, thì kích thước sẽ được chia cho 2 ở đây và trở thành 16.
    #Nhưng tôi cảm thấy thực sự chỉ cần giảm kích thước batch trong dòng này mới có thể làm cho toàn bộ mã có thể chạy, 
    #dường như việc tích lũy gradient phía dưới không được gọi đến. 
    #Ở đây, việc điều chỉnh kích thước batch giống như hoạt động điều chỉnh kích thước batch trong tệp run.sh. 
    #Chỉ là việc thêm một dòng như vậy vào trong tệp train_flow_latent.py làm cho việc này trở nên dễ dàng hơn.

    for epoch in range(init_epoch, args.num_epoch + 1):
        for iteration, (x, y) in enumerate(data_loader):
          with accelerator.accumulate(model): # Ở đây đã sử dụng đoạn mã này: "bọc mã đào tạo trong với accelerator.accumulate(model) 
            x_0 = x.to(device, dtype=dtype, non_blocking=True)
            y = None if not use_label else y.to(device, non_blocking=True)
            #model.zero_grad() Trên diễn đàn PyTorch, có người nói rằng tôi không thể xóa bỏ tại đây, nếu không tích lũy gradient sẽ không hoạt động, vì vậy tôi đã chú thích dòng mã này.
            if is_latent_data:
                z_0 = x_0 * args.scale_factor
            else:
                z_0 = first_stage_model.encode(x_0).latent_dist.sample().mul_(args.scale_factor)
            # sample t
            t = torch.rand((z_0.size(0),), dtype=dtype, device=device)
            t = t.view(-1, 1, 1, 1)
            z_1 = torch.randn_like(z_0)
            # 1 is real noise, 0 is real data
            z_t = (1 - t) * z_0 + (1e-5 + (1 - 1e-5) * t) * z_1
            u = (1 - 1e-5) * z_1 - z_0
            # estimate velocity
            v = model(t.squeeze(), z_t, y)
            loss = F.mse_loss(v, u)
            loss = loss.mean()
            accelerator.backward(loss)             
            # Sau khi tính toán loss, bạn cần xóa gradient của bộ tối ưu hóa bằng cách sử dụng optimizer.zero_grad()
            optimizer.step()
            scheduler.step()
            global_step += 1
            log_steps += 1
            optimizer.zero_grad()
            model.zero_grad()

if __name__ == "__main__":
    parser.add_argument("--gradient_accumulation_steps", type=int, default=2, help="Gradient accumulation")

Tuy nhiên, tôi vẫn không chắc chắn liệu tôi đã gọi tích lũy gradient đúng cách hay không, vì tôi nhận thấy rằng chỉ cần giảm kích thước batch cũng có thể chạy được. Nhưng bản thân tôi vẫn muốn đạt được ví dụ 8 (bs) * 4 (tích lũy gradient) = 32 (bs), nhưng tôi nghi ngờ rằng tôi vẫn chỉ đạt được 8 sau khi thêm đoạn mã này, không thể hiện được hiệu ứng tích lũy gradient = 4...

However, I'm still not sure if I correctly called the gradient accumulation because I found that simply reducing the batch size can also work. But I still hope to achieve, for example, 8 (bs) * 4 (gradient accumulation) = 32 (bs), but I suspect that even after adding this code, it's still 8 and not reflecting the effect of gradient accumulation = 4...

sumorday commented 6 months ago

Ok, below is a guide for adding gradient accumulation.

def train(args): from diffusers.models import AutoencoderKL

assert torch.cuda.is_available(), "Training currently requires at least one GPU."

# Setup accelerator:
**accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)**
device = accelerator.device
dtype = torch.float32
set_seed(args.seed + accelerator.process_index)

**effective_batch_size = args.batch_size // args.gradient_accumulation_steps** 

dataset = get_dataset(args)
data_loader = torch.utils.data.DataLoader(
    dataset,
    **batch_size=effective_batch_size,**
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    drop_last=True,
)

....

for epoch in range(init_epoch, args.num_epoch + 1):
    model.train()
    for iteration, (x, y) in enumerate(data_loader):
        x_0 = x.to(device, dtype=dtype, non_blocking=True)
        y = None if not use_label else y.to(device, non_blocking=True)
        if is_latent_data:
            z_0 = x_0 * args.scale_factor
        else:
            z_0 = first_stage_model.encode(x_0).latent_dist.sample().mul_(args.scale_factor)

        # Apply gradient accumulation
        **with accelerator.accumulate(model):**
            t = torch.rand((z_0.size(0),), dtype=dtype, device=device)
            t = t.view(-1, 1, 1, 1)
            z_1 = torch.randn_like(z_0)
            z_t = (1 - t) * z_0 + (1e-5 + (1 - 1e-5) * t) * z_1
            u = (1 - 1e-5) * z_1 - z_0
            v = model(t.squeeze(), z_t, y)
            loss = F.mse_loss(v, u)
            loss = loss / args.gradient_accumulation_steps
            accelerator.backward(loss)

            if (iteration + 1) % args.gradient_accumulation_steps == 0 or iteration == len(data_loader) - 1:
                optimizer.step()
                scheduler.step()
                model.zero_grad()

VinAIResearch / LFM

Use gradient accumulation during training to address CUDA out of memory issues? #9