EnnengYang / AdaMerging

AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR, 2024.
https://openreview.net/pdf?id=nZP6NgD3QY
MIT License
35 stars 1 forks source link

Couldn't Reproduce the Code #3

Open monody1 opened 3 months ago

monody1 commented 3 months ago

Hello,

I attempted to reproduce the code, but encountered some issues. Could you please provide some insights into how much memory is expected to be used? Additionally, I suspect there might be a memory leak.

dataset_name:SUN397 torch.cuda.memory_allocated:6.21 GB 0%|▏ | 1/503 [00:01<14:33, 1.74s/it] dataset_name:Cars torch.cuda.memory_allocated:11.91 GB 0%|▎ | 1/394 [00:01<08:39, 1.32s/it] dataset_name:RESISC45 torch.cuda.memory_allocated:17.61 GB 1%|▋ | 1/169 [00:01<03:21, 1.20s/it] dataset_name:EuroSAT torch.cuda.memory_allocated:23.31 GB Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/train_32x32.mat Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/test_32x32.mat 0%| | 1/1627 [00:01<38:39, 1.43s/it] dataset_name:SVHN torch.cuda.memory_allocated:29.01 GB 0%| | 0/790 [00:00<?, ?it/s] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 31.74 GiB total capacity; 29.65 GiB already allocated; 25.12 MiB free; 31.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for your assistance.

EnnengYang commented 3 months ago

Hi,

Thank you very much for your interest in our work.

Which architecture are you currently merging? ViT-B/32, ViT-B/16 or ViT-L/14? I remember when I experimented, the ViT-B/32 and ViT-B/16 could be executed on a single 3090 GPU (i.e., 24G).

If you just want to evaluate, you can directly load my trained merge coefficients, which can be found at merging_cofficient.py.

Best, Enneng

monody1 commented 3 months ago

I have been trying the ViT-B/16 architecture on 8 datasets using V100s GPUs (32G) with the main_task_wise_adamerging method. However, I've observed some issues at data loading. Additionally, regarding the loss calculation in an unsupervised setting with 8 datasets: is it correct to understand that the unsupervised loss accumulates across the 8 batches from each dataset, then performs a backward pass? Also, is the order of these 8 batches fixed during unsupervised training?

EnnengYang commented 3 months ago

Hi,

ViT-B/16 doesn't seem to take up much memory, as the checkpoint file for each dataset is only 426.55MB. Can you run with ViT-B/32?

Or is it because you adjusted the batch size? For training coefficients, I default to 16.

On each iteration/step, the unlabeled test set in the code is re-loaded (and the Shuffle function is used), so the batch data is not fixed for each iteration.

Best, Enneng

monody1 commented 3 months ago

I am using the default batch size of 16, but the code crashes during the data loading phase. The issue seems to occur at

x = data['images'].to(args.device)
y = data['labels'].to(args.device)
outputs = adamerging_mtl_model(x, dataset_name)

image is this proceed right?

I have summarized the CUDA memory usage here.

for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)

        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss

        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')   #collect mem usage
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()
Details

`python main_task_wise_adamerging.py TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/SUN397/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/Cars/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/RESISC45/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/EuroSAT/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/SVHN/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/GTSRB/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/MNIST/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/DTD/finetuned.pt Classification head for ViT-B-16 on SUN397 exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SUN397.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SUN397.pt Classification head for ViT-B-16 on Cars exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_Cars.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_Cars.pt Classification head for ViT-B-16 on RESISC45 exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_RESISC45.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_RESISC45.pt Classification head for ViT-B-16 on EuroSAT exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_EuroSAT.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_EuroSAT.pt Classification head for ViT-B-16 on SVHN exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SVHN.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SVHN.pt Classification head for ViT-B-16 on GTSRB exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_GTSRB.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_GTSRB.pt Classification head for ViT-B-16 on MNIST exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_MNIST.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_MNIST.pt Classification head for ViT-B-16 on DTD exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_DTD.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_DTD.pt init lambda: tensor([[1.0000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]], grad_fn=) collect_trainable_params: [Parameter containing: tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]], requires_grad=True)] 0%| | 1/1243 [00:02<58:34, 2.83s/it] SUN397 6.21 GB 0%|▏ | 1/503 [00:01<12:01, 1.44s/it] Cars 11.91 GB 0%|▎ | 1/394 [00:01<08:50, 1.35s/it] RESISC45 17.61 GB 1%|▋ | 1/169 [00:01<03:55, 1.40s/it] EuroSAT 23.31 GB Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/train_32x32.mat Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/test_32x32.mat 0%| | 1/1627 [00:01<37:30, 1.38s/it] Traceback (most recent call last): ..... torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 31.74 GiB total capacity; 28.84 GiB already allocated; 11.12 MiB free; 30.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

monody1 commented 3 months ago

It can run by making this change, but the memory usage is still inefficient.

for epoch in  #range(epochs):
losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))  #get one batch
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)
        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss
        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

image

EnnengYang commented 3 months ago

I am using the default batch size of 16, but the code crashes during the data loading phase. The issue seems to occur at

x = data['images'].to(args.device)
y = data['labels'].to(args.device)
outputs = adamerging_mtl_model(x, dataset_name)

image is this proceed right?

I have summarized the CUDA memory usage here.

for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)

        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss

        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')   #collect mem usage
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

Details

Hi,

Summing the losses across multiple datasets is correct when doing backpropagation and updating the parameters.

EnnengYang commented 3 months ago

It can run by making this change, but the memory usage is still inefficient.

for epoch in  #range(epochs):
losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))  #get one batch
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)
        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss
        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

image

It is true that reloading the dataset per iteration is not efficient. But due to RAM limitations, I can't keep all the dataloaders in memory, so I have to read them separately each iteration.

A simple modifiable solution would be to remove all the training dataloaders, since we won't be using the training set for our project, and only access the test set.

monody1 commented 3 months ago

你好 我还是没有复现出ViT-B/16的效果, 每次进入这个循环

for dataset_name in exam_datasets:

dataloader重新初始化 然后获得一个batch 对吗? 是不是等价于每个epoch都是从8个数据集里重新采样的,也就是在epoch_i 中 batch from dataset_A 和 epoch_j 中 batch from dataset_A 中的样本是可以重复的对吗? 目前λ在500 epoch下[[1.0000, 0.1601, 0.0774, 0.0510, 0.0546, 0.0422, 0.0992, 0.0571, 0.7357]] 还没有接近在merging_cofficient.py 给出的值 [[1.0000, 0.1916, 0.1585, 0.2502, 0.3093, 0.2544, 0.3543, 0.2172, 0.1538]] 而且在8个数据集的avg acc (Eval: Epoch: 499 Avg ACC:0.6502134075542055) 是下降趋势的 能给出更多的细节吗?

谢谢

EnnengYang commented 2 months ago

您好,

for epoch in range(epochs):\\ for dataset_name in exam_datasets: \\ dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16) \\ dataloader = get_dataloader_shuffle(dataset)

即具体实现中,每次进入循环时都重新获取data_loader(其中data_loader中加载时会对数据shuffle),例如datasets/mnist.py中shuffled dataloader为:

self.test_loader_shuffle = torch.utils.data.DataLoader( self.test_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers )

也就是说,每个迭代是从该数据集对应的test datasets里随机采样一个batch,由于随机采样,那么多个迭代里采样到的数据可能会出现少量重叠,但不会完全重叠。

总之,AdaMerging中合并参数优化时,会从每个数据集中随机抽一个Batch的数据出来计算Loss,然后根据加和的Loss计算梯度并更新合并系数。

祝好