Open KIM-JAKE opened 2 months ago
Hello,
Thank you for your interest in our work.
It is true that the version I implemented may not be memory efficient (including weights and dataset loading).
Have you run your implementation? Did the final results differ greatly? Or is there a 'problem with merging coefficients not being updated' as you mentioned?
I briefly modified the code according to the code construction you described, it works, and the merge coefficients are also changing. However, I did not complete all steps/epochs due to time constraints.
The way I modified it is as follows:
class AlphaWrapper(torch.nn.Module):
def __init__(self, paramslist, model, names, exam_datasets):
...
def alpha(self):
...
def collect_trainable_params(self):
...
def get_classification_head(self, dataset_name):
...
def loading_weights(self):
alph = self.alpha()
params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
params = tuple(p.cuda(0) for p in params)
load_weights(self.model, self.names, params)
# return self.model
def get_image_encoder(self):
alph = self.alpha()
params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
params = tuple(p.cuda(0) for p in params)
load_weights(self.model, self.names, params)
return self.model
def forward(self, inp, dataset_name):
# alph = self.alpha()
# params = tuple(sum(tuple(pi * alphai for pi, alphai in zip(p, alph[0].cpu()))) for j, p in enumerate(zip(*self.paramslist)))
# params = tuple(p.cuda(0) for p in params)
# load_weights(self.model, self.names, params)
# feature = self.model(inp)
feature = self.model(inp)
layer_name = 'classifier_{}'.format(dataset_name)
classification_head = getattr(self, layer_name)
out = classification_head(feature)
return out
for epoch in range(epochs):
losses = 0.
adamerging_mtl_model.loading_weights()
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
dataloader = get_dataloader_shuffle(dataset)
for i, data in enumerate(tqdm.tqdm(dataloader)):
data = maybe_dictionarize(data)
x = data['images'].to(args.device)
y = data['labels'].to(args.device)
outputs = alpha_model(x, dataset_name)
loss = softmax_entropy(outputs).mean(0)
losses += loss
if i > 0:
break
optimizer.zero_grad()
losses.backward()
optimizer.step()
print(list(adamerging_mtl_model.alpha().data))
The results are as follows:
init alpha:
tensor([[1.0000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]],
grad_fn=<CatBackward>)
collect_trainable_params:
[Parameter containing:
tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]],
requires_grad=True)]
Eval: init: Avg ACC:0.0
0%| | 1/1243 [00:01<26:03, 1.26s/it]
0%|▎ | 1/503 [00:00<03:43, 2.24it/s]
0%|▎ | 1/394 [00:00<01:41, 3.86it/s]
1%|▊ | 1/169 [00:00<00:17, 9.39it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:56, 14.01it/s]
0%|▏ | 1/790 [00:00<01:24, 9.30it/s]
0%|▏ | 1/625 [00:00<00:37, 16.46it/s]
1%|█▏ | 1/118 [00:00<00:33, 3.50it/s]
[tensor([1.0000, 0.2990, 0.2990, 0.2990, 0.3010, 0.2990, 0.2990, 0.2990, 0.2990])]
0%| | 1/1243 [00:00<19:03, 1.09it/s]
0%|▎ | 1/503 [00:00<05:05, 1.64it/s]
0%|▎ | 1/394 [00:00<01:34, 4.17it/s]
1%|▊ | 1/169 [00:00<00:12, 12.93it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:40, 16.13it/s]
0%|▏ | 1/790 [00:00<01:10, 11.27it/s]
0%|▏ | 1/625 [00:00<00:27, 22.84it/s]
1%|█▏ | 1/118 [00:00<00:30, 3.83it/s]
[tensor([1.0000, 0.2980, 0.2980, 0.2980, 0.3006, 0.2980, 0.2980, 0.2980, 0.2980])]
0%| | 1/1243 [00:00<17:29, 1.18it/s]
0%|▎ | 1/503 [00:00<03:47, 2.21it/s]
0%|▎ | 1/394 [00:00<01:24, 4.67it/s]
1%|▊ | 1/169 [00:00<00:15, 11.01it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:27, 18.62it/s]
0%|▏ | 1/790 [00:00<01:14, 10.65it/s]
0%|▏ | 1/625 [00:00<00:25, 24.62it/s]
1%|█▏ | 1/118 [00:00<00:31, 3.72it/s]
[tensor([1.0000, 0.2970, 0.2971, 0.2970, 0.3004, 0.2970, 0.2970, 0.2970, 0.2970])]
0%| | 1/1243 [00:00<15:37, 1.32it/s]
0%|▎ | 1/503 [00:00<06:29, 1.29it/s]
0%|▎ | 1/394 [00:00<00:42, 9.24it/s]
1%|▊ | 1/169 [00:00<00:10, 16.70it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:31, 17.75it/s]
0%|▏ | 1/790 [00:00<01:23, 9.40it/s]
0%|▏ | 1/625 [00:00<00:26, 23.73it/s]
1%|█▏ | 1/118 [00:00<00:28, 4.09it/s]
[tensor([1.0000, 0.2961, 0.2961, 0.2961, 0.2999, 0.2961, 0.2961, 0.2961, 0.2960])]
0%| | 1/1243 [00:01<24:16, 1.17s/it]
0%|▎ | 1/503 [00:00<03:22, 2.48it/s]
0%|▎ | 1/394 [00:00<01:49, 3.58it/s]
1%|▊ | 1/169 [00:00<00:09, 16.92it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:26, 18.79it/s]
0%|▏ | 1/790 [00:00<01:05, 12.07it/s]
0%|▏ | 1/625 [00:00<00:25, 24.29it/s]
1%|█▏ | 1/118 [00:00<00:30, 3.85it/s]
[tensor([1.0000, 0.2951, 0.2952, 0.2952, 0.3002, 0.2951, 0.2951, 0.2951, 0.2951])]
0%| | 1/1243 [00:01<20:52, 1.01s/it]
0%|▎ | 1/503 [00:00<04:00, 2.09it/s]
0%|▎ | 1/394 [00:00<01:36, 4.09it/s]
1%|▊ | 1/169 [00:00<00:09, 16.81it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:27, 18.67it/s]
0%|▏ | 1/790 [00:00<01:15, 10.41it/s]
0%|▏ | 1/625 [00:00<00:27, 23.09it/s]
1%|█▏ | 1/118 [00:00<00:31, 3.69it/s]
[tensor([1.0000, 0.2941, 0.2943, 0.2943, 0.3006, 0.2941, 0.2941, 0.2941, 0.2941])]
0%| | 1/1243 [00:00<16:54, 1.22it/s]
0%|▎ | 1/503 [00:00<03:14, 2.58it/s]
0%|▎ | 1/394 [00:00<01:16, 5.11it/s]
1%|▊ | 1/169 [00:00<00:09, 17.05it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:26, 18.80it/s]
0%|▏ | 1/790 [00:00<01:07, 11.62it/s]
0%|▏ | 1/625 [00:00<00:26, 23.96it/s]
1%|█▏ | 1/118 [00:00<00:25, 4.54it/s]
[tensor([1.0000, 0.2931, 0.2933, 0.2934, 0.3011, 0.2932, 0.2932, 0.2931, 0.2931])]
0%| | 1/1243 [00:00<13:28, 1.54it/s]
0%|▎ | 1/503 [00:00<04:56, 1.69it/s]
0%|▎ | 1/394 [00:00<01:22, 4.78it/s]
1%|▊ | 1/169 [00:00<00:11, 14.97it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:30, 17.98it/s]
0%|▏ | 1/790 [00:00<01:23, 9.41it/s]
0%|▏ | 1/625 [00:00<00:26, 23.69it/s]
1%|█▏ | 1/118 [00:00<00:29, 3.95it/s]
[tensor([1.0000, 0.2921, 0.2924, 0.2925, 0.3015, 0.2922, 0.2922, 0.2921, 0.2921])]
0%| | 1/1243 [00:00<17:42, 1.17it/s]
0%|▎ | 1/503 [00:00<03:16, 2.56it/s]
0%|▎ | 1/394 [00:00<01:36, 4.06it/s]
1%|▊ | 1/169 [00:00<00:10, 15.96it/s]
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/train_32x32.mat
Using downloaded and verified file: /root/autodl-tmp/dataset/svhn/test_32x32.mat
0%| | 1/1627 [00:00<01:23, 19.38it/s]
0%|▏ | 1/790 [00:00<01:18, 10.03it/s]
0%|▏ | 1/625 [00:00<00:28, 22.18it/s]
1%|█▏ | 1/118 [00:00<00:28, 4.16it/s]
[tensor([1.0000, 0.2911, 0.2914, 0.2917, 0.3020, 0.2912, 0.2913, 0.2911, 0.2911])]
In short, you can run your code, and if the results are not significantly different, your way seems to be a more efficient implementation. This implementation logic seems consistent to me.
Sincerely,
Thank you for the prompt response.
It seems to be working, but due to resource constraints, I used a batch size of 8 and a learning rate of 4e-5. (I’m trying to reproduce ViT-L-14 & AdaMerging layerwise++ using only 1 step.)
The result is as below :
Eval: Epoch: 499 dataset: SUN397 ACC: 0.7684137931034483 Eval: Epoch: 499 dataset: Cars ACC: 0.8722795672180077 Eval: Epoch: 499 dataset: RESISC45 ACC: 0.91 Eval: Epoch: 499 dataset: EuroSAT ACC: 0.9588888888888889 Eval: Epoch: 499 dataset: SVHN ACC: 0.9015826674861709 Eval: Epoch: 499 dataset: GTSRB ACC: 0.9199524940617577 Eval: Epoch: 499 dataset: MNIST ACC: 0.9907 Eval: Epoch: 499 dataset: DTD ACC: 0.7191489361702128 Eval: Epoch: 499 Avg ACC:0.8801207933660609
By the way, it seems that the performance results for ViT-L-14, layerwise++, at 0.1% or 1% were not included in the paper. If possible, could you please share any performance data that the authors obtained for these configurations? I would greatly appreciate it.
Hello,
Cool, your results are an improvement over Task Arithmetic(84.5%) and Ties-Merging(86.0%) on ViT-L-14.
By the way, if you just want to evaluate our method, you can call the trained merge coefficients (merging_cofficient.py) directly.
In the paper, we only used 0.1% or 1% evaluation for ViT-B-32, not for ViT-L-14. The configuration of the ViT-B-32 during evaluation is as follows:
Original version:
class Cars:
def __init__(self,
preprocess,
location=os.path.expanduser('~/data'),
batch_size=32,
num_workers=0):
# Data loading code
self.train_dataset = PytorchStanfordCars(location, 'train', preprocess, download=False)
self.train_loader = torch.utils.data.DataLoader(
self.train_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=num_workers,
)
self.test_dataset = PytorchStanfordCars(location, 'test', preprocess, download=False)
self.test_loader = torch.utils.data.DataLoader(
self.test_dataset,
batch_size=batch_size,
num_workers=num_workers
)
self.test_loader_shuffle = torch.utils.data.DataLoader(
self.test_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=num_workers
)
idx_to_class = dict((v, k) for k, v in self.train_dataset.class_to_idx.items())
self.classnames = [idx_to_class[i].replace(
'_', ' ') for i in range(len(idx_to_class))]
Modified version:
class Cars:
def __init__(self,
preprocess,
location=os.path.expanduser('~/data'),
batch_size=32,
num_workers=0,
test_data_ratio=1.0):
# Data loading code
self.train_dataset = PytorchStanfordCars(location, 'train', preprocess, download=False)
self.train_loader = torch.utils.data.DataLoader(
self.train_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=num_workers,
)
self.test_dataset = PytorchStanfordCars(location, 'test', preprocess, download=False)
self.test_loader = torch.utils.data.DataLoader(
self.test_dataset,
batch_size=batch_size,
num_workers=num_workers
)
if test_data_ratio < 1.0:
random.seed(42)
random_indexs = random.sample(range(len(self.test_dataset)), int(test_data_ratio * len(self.test_dataset)))
print('cars test_data_ratio:' + str(test_data_ratio))
print('cars random_indexs:'+str(random_indexs))
self.test_dataset_training = Subset(self.test_dataset, random_indexs)
else:
self.test_dataset_training = self.test_dataset
self.test_loader_shuffle = torch.utils.data.DataLoader(
self.test_dataset_training,
shuffle=True,
batch_size=batch_size,
num_workers=num_workers
)
idx_to_class = dict((v, k) for k, v in self.train_dataset.class_to_idx.items())
self.classnames = [idx_to_class[i].replace('_', ' ') for i in range(len(idx_to_class))]
Original version:
def get_dataset(dataset_name, preprocess, location, batch_size=128, num_workers=0, val_fraction=0.1, max_val_samples=5000):
if dataset_name.endswith('Val'):
# Handle val splits
if dataset_name in registry:
dataset_class = registry[dataset_name]
else:
base_dataset_name = dataset_name.split('Val')[0]
base_dataset = get_dataset(base_dataset_name, preprocess, location, batch_size, num_workers)
dataset = split_train_into_train_val(
base_dataset, dataset_name, batch_size, num_workers, val_fraction, max_val_samples)
return dataset
else:
assert dataset_name in registry, f'Unsupported dataset: {dataset_name}. Supported datasets: {list(registry.keys())}'
dataset_class = registry[dataset_name]
dataset = dataset_class(
preprocess, location=location, batch_size=batch_size, num_workers=num_workers
)
return dataset
Modified version:
def get_dataset(dataset_name, preprocess, location, batch_size=128, num_workers=0, test_data_ratio=1.0, val_fraction=0.1, max_val_samples=5000):
if dataset_name.endswith('Val'):
# Handle val splits
if dataset_name in registry:
dataset_class = registry[dataset_name]
else:
base_dataset_name = dataset_name.split('Val')[0]
base_dataset = get_dataset(base_dataset_name, preprocess, location, batch_size, num_workers, test_data_ratio)
dataset = split_train_into_train_val(
base_dataset, dataset_name, batch_size, num_workers, val_fraction, max_val_samples)
return dataset
else:
assert dataset_name in registry, f'Unsupported dataset: {dataset_name}. Supported datasets: {list(registry.keys())}'
dataset_class = registry[dataset_name]
dataset = dataset_class(preprocess, location=location, batch_size=batch_size, num_workers=num_workers, test_data_ratio=test_data_ratio)
return dataset
Original version:
for epoch in range(epochs):
losses = 0.
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
dataloader = get_dataloader_shuffle(dataset)
...
Modified version:
args.test_data_ratio = 0.01 # 0.01 or 0.1
for epoch in range(epochs):
losses = 0.
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16, test_data_ratio=args.test_data_ratio)
dataloader = get_dataloader_shuffle(dataset)
...
I wish you all the best.
Sincerely,
Thank you for the detailed explanation.
I'll refer to the code you provided and run more experiments.
Thank you!
Hello,
for epoch in range(epochs):
losses = 0.
adamerging_mtl_model.loading_weights()
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
dataloader = get_dataloader_shuffle(dataset)
for i, data in enumerate(tqdm.tqdm(dataloader)):
data = maybe_dictionarize(data)
x = data['images'].to(args.device)
y = data['labels'].to(args.device)
with autocast():
outputs = adamerging_mtl_model(x, dataset_name)
loss = softmax_entropy(outputs).mean(0)
losses += loss
if i > 0: # Execute only one step
break
I believe this code is actually performing two steps instead of one. The final condition ends the loop when i=1, which means two steps have already been completed by that point. Is that correct?
Thank you.
Hello,
This could be interpreted as either two steps or one step(double the batchsize), but the latter is probably more appropriate because the loss is calculated for two batches and no parameter updates are performed. Note that loptimizer.zero_grad() losses.backward() optimizer.step()
occur after all tasks have calculated their losses.
Thanks.
In your paper, you mentioned using a batch size of 16. Should I understand that, in the code, the batch_size
was set to 8, and you called the batch twice to achieve a total batch size of 16?
get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=8)
like this
Thank you
I think my description is not accurate enough in paper, I am very sorry. Because I remember that I wrote $i>=0$ in the code (that is, only one step was executed, but later I noticed that the condition I wrote was $i>0$).
So, it can be understood that I ran 500 steps with batchsize=32, or ran 1000 steps with batchsize=16.
Sorry again for your trouble.
Sincerely
For additional information for @EnnengYang
I am using an A6000 GPU with 48GB of memory, and I encounter a memory issue when processing the 4th dataset out of 8 datasets with provided code (16 batch, 2 steps).
Therefore, it is estimated that approximately 90GB or more of GPU memory would be required, and I speculate that the author did not use this method.
so I am using 16-bit floating point precision and a 1 step size as below which requires about 40GB. (16 batch, 1step)
for epoch in range(epochs):
losses = 0.
adamerging_mtl_model.loading_weights()
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=args.batch_size)
dataloader = get_dataloader_shuffle(dataset)
data = next(iter(dataloader)) # get one batch
data = maybe_dictionarize(data)
x = data['images'].to(args.device)
y = data['labels'].to(args.device)
with autocast():
outputs = adamerging_mtl_model(x, dataset_name)
loss = softmax_entropy(outputs).mean(0)
losses += loss
optimizer.zero_grad()
scaler.scale(losses).backward()
scaler.step(optimizer)
scaler.update()
Question : Could you let me know which GPU was used?
For additional information for @EnnengYang and @kasurashan .
I successfully reproduced the results, achieving an average score of 91.0 in layerwise++ / ViT-L-14 with a batch size of 16 and 1 step with code above. This was done over 800 epochs with a learning rate of 5e-3. Please take note.
Hello,
Thank you for providing the codes.
The training process in provided code calculates the loss for each dataset and aggregates it to update the coefficients.
Therefore, the coefficients (lambdas) remain constant throughout the entire data iteration within a single epoch.
However, the original code (below) performs a computation where the parameters are loaded onto the CPU during each forward pass:
In my environment, loading these onto the CPU repeatedly caused memory issues.
Therefore, I modified the code as follows, loading the coefficient parameters into the model at the beginning of each epoch and processing the data accordingly.
In the training process:
Is there any aspect of this approach that differs from the author's intent, or could there be any other issues arising from this modification?
Thank you.