EnnengYang / AdaMerging

AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR, 2024.
https://openreview.net/pdf?id=nZP6NgD3QY
MIT License
52 stars 3 forks source link

Memory management in training process #7

Open KIM-JAKE opened 2 months ago

KIM-JAKE commented 2 months ago

Hello,

Thank you for providing the codes.

The training process in provided code calculates the loss for each dataset and aggregates it to update the coefficients.

Therefore, the coefficients (lambdas) remain constant throughout the entire data iteration within a single epoch.

However, the original code (below) performs a computation where the parameters are loaded onto the CPU during each forward pass:

def forward(self, inp, dataset_name):
    alph = self.lambdas()
    params = tuple(sum(tuple(pi * lambdasi for pi, lambdasi in zip(p, alph[j].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
    params = tuple(p.cuda(0) for p in params)
    load_weights(self.model, self.names, params)
    feature = this.model(inp)
    layer_name = 'classifier_{}'.format(dataset_name)
    classification_head = getattr(self, layer_name)
    out = classification_head(feature)

In my environment, loading these onto the CPU repeatedly caused memory issues.

Therefore, I modified the code as follows, loading the coefficient parameters into the model at the beginning of each epoch and processing the data accordingly.

def loading_weights(self):
    alph = self.lambdas()
    params = tuple(sum(tuple(pi * lambdasi for pi, lambdasi in zip(p, alph[j].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
    params are tuple(p.cuda(0) for p in params)
    load_weights(self.model, her names, params)

def forward(self, inp, dataset_name):
    # For memory efficiency, load weights in advance.
    # alph = her lambdas()
    # params = tuple(sum(tuple(pi * lambdasi for pi, lambdasi in zip(p, alph[j].cpu()))) for j, p in enumerate(zip(*her paramslist)))
    # params = tuple(p.cuda(0) for p in params)
    # load_weights(self.model, her names, params)
    feature = her model(inp)
    layer_name = 'classifier_{}'.format(dataset_name)
    classification_head = gettingattr(her, layer_name)
    out = classification_head(feature)
    return out

In the training process:

for epoch in range(epochs):
    losses = 0.

    adamerging_mtl_model.loading_weights()

    for dataset_name in exam_datasets:
        # load dataset and calculate loss
        losses += loss

    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

Is there any aspect of this approach that differs from the author's intent, or could there be any other issues arising from this modification?

Thank you.

EnnengYang commented 2 months ago

Hello,

Thank you for your interest in our work.

It is true that the version I implemented may not be memory efficient (including weights and dataset loading).

Have you run your implementation? Did the final results differ greatly? Or is there a 'problem with merging coefficients not being updated' as you mentioned?


I briefly modified the code according to the code construction you described, it works, and the merge coefficients are also changing. However, I did not complete all steps/epochs due to time constraints.

The way I modified it is as follows:

class AlphaWrapper(torch.nn.Module):
    def __init__(self, paramslist, model, names, exam_datasets):
        ...

    def alpha(self):
        ...

    def collect_trainable_params(self):
        ...

    def get_classification_head(self, dataset_name):
        ...

    def loading_weights(self):
        alph = self.alpha()
        params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
        params = tuple(p.cuda(0) for p in params)
        load_weights(self.model, self.names, params)
        # return self.model

    def get_image_encoder(self):
        alph = self.alpha()
        params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
        params = tuple(p.cuda(0) for p in params)
        load_weights(self.model, self.names, params)
        return self.model

    def forward(self, inp, dataset_name):
        # alph = self.alpha()
        # params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))

        # params = tuple(p.cuda(0) for p in params)
        # load_weights(self.model, self.names, params)
        # feature = self.model(inp)
        feature = self.model(inp)
        layer_name = 'classifier_{}'.format(dataset_name)
        classification_head = getattr(self, layer_name)
        out = classification_head(feature)

        return out
for epoch in range(epochs):
    losses = 0.
    adamerging_mtl_model.loading_weights()

    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)

        for i, data in enumerate(tqdm.tqdm(dataloader)):
            data = maybe_dictionarize(data)
            x = data['images'].to(args.device)
            y = data['labels'].to(args.device)

            outputs = alpha_model(x, dataset_name)
            loss = softmax_entropy(outputs).mean(0)
            losses += loss

            if i > 0:
                break

    optimizer.zero_grad()
    losses.backward()
    optimizer.step()
    print(list(adamerging_mtl_model.alpha().data))

The results are as follows:

init alpha:
tensor([[1.0000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]],
       grad_fn=<CatBackward>)
collect_trainable_params:
[Parameter containing:
tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]],
       requires_grad=True)]
Eval: init:  Avg ACC:0.0

  0%|                                                                                                                                                | 1/1243 [00:01<26:03,  1.26s/it]
  0%|▎                                                                                                                                                | 1/503 [00:00<03:43,  2.24it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:41,  3.86it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:17,  9.39it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:56, 14.01it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:24,  9.30it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:37, 16.46it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:33,  3.50it/s]
[tensor([1.0000, 0.2990, 0.2990, 0.2990, 0.3010, 0.2990, 0.2990, 0.2990, 0.2990])]
  0%|                                                                                                                                                | 1/1243 [00:00<19:03,  1.09it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<05:05,  1.64it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:34,  4.17it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:12, 12.93it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:40, 16.13it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:10, 11.27it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:27, 22.84it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:30,  3.83it/s]
[tensor([1.0000, 0.2980, 0.2980, 0.2980, 0.3006, 0.2980, 0.2980, 0.2980, 0.2980])]
  0%|                                                                                                                                                | 1/1243 [00:00<17:29,  1.18it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<03:47,  2.21it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:24,  4.67it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:15, 11.01it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:27, 18.62it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:14, 10.65it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:25, 24.62it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:31,  3.72it/s]
[tensor([1.0000, 0.2970, 0.2971, 0.2970, 0.3004, 0.2970, 0.2970, 0.2970, 0.2970])]
  0%|                                                                                                                                                | 1/1243 [00:00<15:37,  1.32it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<06:29,  1.29it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<00:42,  9.24it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:10, 16.70it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:31, 17.75it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:23,  9.40it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:26, 23.73it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:28,  4.09it/s]
[tensor([1.0000, 0.2961, 0.2961, 0.2961, 0.2999, 0.2961, 0.2961, 0.2961, 0.2960])]
  0%|                                                                                                                                                | 1/1243 [00:01<24:16,  1.17s/it]
  0%|▎                                                                                                                                                | 1/503 [00:00<03:22,  2.48it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:49,  3.58it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:09, 16.92it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:26, 18.79it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:05, 12.07it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:25, 24.29it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:30,  3.85it/s]
[tensor([1.0000, 0.2951, 0.2952, 0.2952, 0.3002, 0.2951, 0.2951, 0.2951, 0.2951])]
  0%|                                                                                                                                                | 1/1243 [00:01<20:52,  1.01s/it]
  0%|▎                                                                                                                                                | 1/503 [00:00<04:00,  2.09it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:36,  4.09it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:09, 16.81it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:27, 18.67it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:15, 10.41it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:27, 23.09it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:31,  3.69it/s]
[tensor([1.0000, 0.2941, 0.2943, 0.2943, 0.3006, 0.2941, 0.2941, 0.2941, 0.2941])]
  0%|                                                                                                                                                | 1/1243 [00:00<16:54,  1.22it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<03:14,  2.58it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:16,  5.11it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:09, 17.05it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:26, 18.80it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:07, 11.62it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:26, 23.96it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:25,  4.54it/s]
[tensor([1.0000, 0.2931, 0.2933, 0.2934, 0.3011, 0.2932, 0.2932, 0.2931, 0.2931])]
  0%|                                                                                                                                                | 1/1243 [00:00<13:28,  1.54it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<04:56,  1.69it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:22,  4.78it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:11, 14.97it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:30, 17.98it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:23,  9.41it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:26, 23.69it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:29,  3.95it/s]
[tensor([1.0000, 0.2921, 0.2924, 0.2925, 0.3015, 0.2922, 0.2922, 0.2921, 0.2921])]
  0%|                                                                                                                                                | 1/1243 [00:00<17:42,  1.17it/s]
  0%|▎                                                                                                                                                | 1/503 [00:00<03:16,  2.56it/s]
  0%|▎                                                                                                                                                | 1/394 [00:00<01:36,  4.06it/s]
  1%|▊                                                                                                                                                | 1/169 [00:00<00:10, 15.96it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
  0%|                                                                                                                                                | 1/1627 [00:00<01:23, 19.38it/s]
  0%|▏                                                                                                                                                | 1/790 [00:00<01:18, 10.03it/s]
  0%|▏                                                                                                                                                | 1/625 [00:00<00:28, 22.18it/s]
  1%|█▏                                                                                                                                               | 1/118 [00:00<00:28,  4.16it/s]
[tensor([1.0000, 0.2911, 0.2914, 0.2917, 0.3020, 0.2912, 0.2913, 0.2911, 0.2911])]

In short, you can run your code, and if the results are not significantly different, your way seems to be a more efficient implementation. This implementation logic seems consistent to me.

Sincerely,

KIM-JAKE commented 2 months ago

Thank you for the prompt response.

It seems to be working, but due to resource constraints, I used a batch size of 8 and a learning rate of 4e-5. (I’m trying to reproduce ViT-L-14 & AdaMerging layerwise++ using only 1 step.)​

The result is as below :

Eval: Epoch: 499 dataset: SUN397 ACC: 0.7684137931034483 Eval: Epoch: 499 dataset: Cars ACC: 0.8722795672180077 Eval: Epoch: 499 dataset: RESISC45 ACC: 0.91 Eval: Epoch: 499 dataset: EuroSAT ACC: 0.9588888888888889 Eval: Epoch: 499 dataset: SVHN ACC: 0.9015826674861709 Eval: Epoch: 499 dataset: GTSRB ACC: 0.9199524940617577 Eval: Epoch: 499 dataset: MNIST ACC: 0.9907 Eval: Epoch: 499 dataset: DTD ACC: 0.7191489361702128 Eval: Epoch: 499 Avg ACC:0.8801207933660609

By the way, it seems that the performance results for ViT-L-14, layerwise++, at 0.1% or 1% were not included in the paper. If possible, could you please share any performance data that the authors obtained for these configurations? I would greatly appreciate it.

EnnengYang commented 2 months ago

Hello,

Cool, your results are an improvement over Task Arithmetic(84.5%) and Ties-Merging(86.0%) on ViT-L-14.

By the way, if you just want to evaluate our method, you can call the trained merge coefficients (merging_cofficient.py) directly.


In the paper, we only used 0.1% or 1% evaluation for ViT-B-32, not for ViT-L-14. The configuration of the ViT-B-32 during evaluation is as follows:

  1. Modify each file in the dataset (take 'datasets/cars.py' as an example) :

Original version:

class Cars:
    def __init__(self,
                 preprocess,
                 location=os.path.expanduser('~/data'),
                 batch_size=32,
                 num_workers=0):
        # Data loading code

        self.train_dataset = PytorchStanfordCars(location, 'train', preprocess, download=False)
        self.train_loader = torch.utils.data.DataLoader(
            self.train_dataset,
            shuffle=True,
            batch_size=batch_size,
            num_workers=num_workers,
        )

        self.test_dataset = PytorchStanfordCars(location, 'test', preprocess, download=False)
        self.test_loader = torch.utils.data.DataLoader(
            self.test_dataset,
            batch_size=batch_size,
            num_workers=num_workers
        )
        self.test_loader_shuffle = torch.utils.data.DataLoader(
            self.test_dataset,
            shuffle=True,
            batch_size=batch_size,
            num_workers=num_workers
        )
        idx_to_class = dict((v, k) for k, v in self.train_dataset.class_to_idx.items())
        self.classnames = [idx_to_class[i].replace(
            '_', ' ') for i in range(len(idx_to_class))]

Modified version:

class Cars:
    def __init__(self,
                 preprocess,
                 location=os.path.expanduser('~/data'),
                 batch_size=32,
                 num_workers=0,
                 test_data_ratio=1.0):
        # Data loading code

        self.train_dataset = PytorchStanfordCars(location, 'train', preprocess, download=False)
        self.train_loader = torch.utils.data.DataLoader(
            self.train_dataset,
            shuffle=True,
            batch_size=batch_size,
            num_workers=num_workers,
        )

        self.test_dataset = PytorchStanfordCars(location, 'test', preprocess, download=False)
        self.test_loader = torch.utils.data.DataLoader(
            self.test_dataset,
            batch_size=batch_size,
            num_workers=num_workers
        )

        if test_data_ratio < 1.0:
            random.seed(42)
            random_indexs = random.sample(range(len(self.test_dataset)), int(test_data_ratio * len(self.test_dataset)))
            print('cars test_data_ratio:' + str(test_data_ratio))
            print('cars random_indexs:'+str(random_indexs))
            self.test_dataset_training = Subset(self.test_dataset, random_indexs)
        else:
            self.test_dataset_training = self.test_dataset

        self.test_loader_shuffle = torch.utils.data.DataLoader(
            self.test_dataset_training,
            shuffle=True,
            batch_size=batch_size,
            num_workers=num_workers
        )
        idx_to_class = dict((v, k) for k, v in self.train_dataset.class_to_idx.items())
        self.classnames = [idx_to_class[i].replace('_', ' ') for i in range(len(idx_to_class))]
  1. Modify 'datasets/registry.py' file:

Original version:

def get_dataset(dataset_name, preprocess, location, batch_size=128, num_workers=0, val_fraction=0.1, max_val_samples=5000):
    if dataset_name.endswith('Val'):
        # Handle val splits
        if dataset_name in registry:
            dataset_class = registry[dataset_name]
        else:
            base_dataset_name = dataset_name.split('Val')[0]
            base_dataset = get_dataset(base_dataset_name, preprocess, location, batch_size, num_workers)
            dataset = split_train_into_train_val(
                base_dataset, dataset_name, batch_size, num_workers, val_fraction, max_val_samples)
            return dataset
    else:
        assert dataset_name in registry, f'Unsupported dataset: {dataset_name}. Supported datasets: {list(registry.keys())}'
        dataset_class = registry[dataset_name]
    dataset = dataset_class(
        preprocess, location=location, batch_size=batch_size, num_workers=num_workers
    )
    return dataset

Modified version:

def get_dataset(dataset_name, preprocess, location, batch_size=128, num_workers=0, test_data_ratio=1.0, val_fraction=0.1, max_val_samples=5000):
    if dataset_name.endswith('Val'):
        # Handle val splits
        if dataset_name in registry:
            dataset_class = registry[dataset_name]
        else:
            base_dataset_name = dataset_name.split('Val')[0]
            base_dataset = get_dataset(base_dataset_name, preprocess, location, batch_size, num_workers, test_data_ratio)
            dataset = split_train_into_train_val(
                base_dataset, dataset_name, batch_size, num_workers, val_fraction, max_val_samples)
            return dataset
    else:
        assert dataset_name in registry, f'Unsupported dataset: {dataset_name}. Supported datasets: {list(registry.keys())}'
        dataset_class = registry[dataset_name]
    dataset = dataset_class(preprocess, location=location, batch_size=batch_size, num_workers=num_workers, test_data_ratio=test_data_ratio)
    return dataset
  1. Modify 'main_task_wise_adamerging.py' file:

Original version:

for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
       ...

Modified version:

args.test_data_ratio = 0.01 # 0.01 or 0.1
for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16, test_data_ratio=args.test_data_ratio)
        dataloader = get_dataloader_shuffle(dataset)
        ...

I wish you all the best.

Sincerely,

KIM-JAKE commented 2 months ago

Thank you for the detailed explanation.

I'll refer to the code you provided and run more experiments.

Thank you!

KIM-JAKE commented 2 months ago

Hello,

for epoch in range(epochs):
    losses = 0.

    adamerging_mtl_model.loading_weights()      
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)

        for i, data in enumerate(tqdm.tqdm(dataloader)):
            data = maybe_dictionarize(data)
            x = data['images'].to(args.device)
            y = data['labels'].to(args.device)

            with autocast():
                outputs = adamerging_mtl_model(x, dataset_name)
                loss = softmax_entropy(outputs).mean(0)
                losses += loss

            if i > 0: # Execute only one step
                break

I believe this code is actually performing two steps instead of one. The final condition ends the loop when i=1, which means two steps have already been completed by that point. Is that correct?

Thank you.​

EnnengYang commented 2 months ago

Hello,

This could be interpreted as either two steps or one step(double the batchsize), but the latter is probably more appropriate because the loss is calculated for two batches and no parameter updates are performed. Note that loptimizer.zero_grad() losses.backward() optimizer.step() occur after all tasks have calculated their losses.

Thanks.

kasurashan commented 2 months ago

In your paper, you mentioned using a batch size of 16. Should I understand that, in the code, the batch_size was set to 8, and you called the batch twice to achieve a total batch size of 16? get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=8) like this

Thank you

EnnengYang commented 2 months ago

I think my description is not accurate enough in paper, I am very sorry. Because I remember that I wrote $i>=0$ in the code (that is, only one step was executed, but later I noticed that the condition I wrote was $i>0$).

So, it can be understood that I ran 500 steps with batchsize=32, or ran 1000 steps with batchsize=16.

Sorry again for your trouble.

Sincerely

KIM-JAKE commented 2 months ago

For additional information for @EnnengYang

I am using an A6000 GPU with 48GB of memory, and I encounter a memory issue when processing the 4th dataset out of 8 datasets with provided code (16 batch, 2 steps).

Therefore, it is estimated that approximately 90GB or more of GPU memory would be required, and I speculate that the author did not use this method.

so I am using 16-bit floating point precision and a 1 step size as below which requires about 40GB. (16 batch, 1step)

for epoch in range(epochs):
    losses = 0.

    adamerging_mtl_model.loading_weights()      
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=args.batch_size)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))  # get one batch
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)

        with autocast():
            outputs = adamerging_mtl_model(x, dataset_name)
            loss = softmax_entropy(outputs).mean(0)
            losses += loss

    optimizer.zero_grad()
    scaler.scale(losses).backward()
    scaler.step(optimizer)
    scaler.update()

Question : Could you let me know which GPU was used?

KIM-JAKE commented 2 months ago

For additional information for @EnnengYang and @kasurashan .

I successfully reproduced the results, achieving an average score of 91.0 in layerwise++ / ViT-L-14 with a batch size of 16 and 1 step with code above. This was done over 800 epochs with a learning rate of 5e-3. Please take note.