ShoufaChen / AdaptFormer

[NeurIPS 2022] Implementation of "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition"
https://arxiv.org/abs/2205.13535
MIT License
325 stars 20 forks source link

What is the difference between AdaptFormer and baseline "Adatper" in VPT [46]? #13

Closed jingzhengli closed 2 years ago

jingzhengli commented 2 years ago

Thanks for sharing the nice work. In my opinion, the AdaptFormer is the same as the baseline "Adatper" used in VPT [46]. Am I misunderstanding?

ShoufaChen commented 2 years ago

Thanks for your interests in our work. The designs VPT [1] and AdaptFormer are both insipred by recent advances of efficient parameter tuning in NLP field. VPT focused on promp tuning and directly utilized the Adapter from AdaptFusion [2]. However, we focused on the Adapter structure and evaluate our AdaptFormer on both image and video tasks.

Compared with the vanilla Adapter in VPT [1], our AdaptFormer:

(i) uses the scaling factor s is to balance the task-agnostic features (generated by the original frozen branch) and the task-specific features (generated by the tunable bottleneck branch). We evaluate AdaptFormer with multiple s values and the results are summarized in Table 2c. We also provided detailed discussion in the main text.

(ii) further compresses middle dimension of the AdaptMLP module. We aim to seek for a trade-off between model capacity (i.e., potential) and adaptation efficiency. In fact, the middle dimension has a main influence on the parameter size of adapter. The higher dimension brings more parameters while the efficiency and storage are limited. As shown in below, we evaluate several numbers of middle dimension and found that using 64 (reduction rate is 12) is optimal to achieve accuracy, light-weight storage, and efficiency.

Middle Dim #Params SSv2 Top1 Acc SSv2 #Params NUS-WIDE mAP NUS-WIDE
1 0.16 50.03 0.09 57.51
4 0.22 54.70 0.15 58.14
16 0.44 57.62 0.37 59.00
32 0.73 58.27 0.66 59.09
64 1.32 59.02 1.25 59.07
128 2.51 58.95 2.43 59.49
256 4.87 58.87 4.79 59.62
512 9.59 58.98 9.51 59.82

We have added these discussion in our camery ready version. Please stay tuned.

[1] Jia, Menglin, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. "Visual prompt tuning." ECCV 2022.

[2] Pfeiffer, Jonas, Kamath, Aishwarya, Rücklé, Andreas, Cho, Kyunghyun, and Gurevych, Iryna. "Adapterfusion: Nondestructive task composition for transfer learning." EACL 2021

jingzhengli commented 2 years ago

Thanks for your interests in our work. The designs VPT [1] and AdaptFormer are both insipred by recent advances of efficient parameter tuning in NLP field. VPT focused on promp tuning and directly utilized the Adapter from AdaptFusion [2]. However, we focused on the Adapter structure and evaluate our AdaptFormer on both image and video tasks.

Compared with the vanilla Adapter in VPT [1], our AdaptFormer:

(i) uses the scaling factor s is to balance the task-agnostic features (generated by the original frozen branch) and the task-specific features (generated by the tunable bottleneck branch). We evaluate AdaptFormer with multiple s values and the results are summarized in Table 2c. We also provided detailed discussion in the main text.

(ii) further compresses middle dimension of the AdaptMLP module. We aim to seek for a trade-off between model capacity (i.e., potential) and adaptation efficiency. In fact, the middle dimension has a main influence on the parameter size of adapter. The higher dimension brings more parameters while the efficiency and storage are limited. As shown in below, we evaluate several numbers of middle dimension and found that using 64 (reduction rate is 12) is optimal to achieve accuracy, light-weight storage, and efficiency.

Middle Dim #Params SSv2 Top1 Acc SSv2 #Params NUS-WIDE mAP NUS-WIDE 1 0.16 50.03 0.09 57.51 4 0.22 54.70 0.15 58.14 16 0.44 57.62 0.37 59.00 32 0.73 58.27 0.66 59.09 64 1.32 59.02 1.25 59.07 128 2.51 58.95 2.43 59.49 256 4.87 58.87 4.79 59.62 512 9.59 58.98 9.51 59.82 We have added these discussion in our camery ready version. Please stay tuned.

[1] Jia, Menglin, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. "Visual prompt tuning." ECCV 2022.

[2] Pfeiffer, Jonas, Kamath, Aishwarya, Rücklé, Andreas, Cho, Kyunghyun, and Gurevych, Iryna. "Adapterfusion: Nondestructive task composition for transfer learning." EACL 2021

Thanks for your reply.