How come vanilla finetuning of bert with ~ 100x more trainable parameters compared to bert + adapter takes just 2x time?

adapter-hub / adapters

A Unified Library for Parameter-Efficient and Modular Transfer Learning

https://docs.adapterhub.ml

Apache License 2.0

2.54k stars 339 forks source link

How come vanilla finetuning of bert with ~ 100x more trainable parameters compared to bert + adapter takes just 2x time? #360

Closed macabdul9 closed 1 year ago

macabdul9 commented 2 years ago

Environment info

adapter-transformers version: 3.0.1
Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.13
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): 2.8.2 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:Yes
Using distributed or parallel set-up in script?: No

Details

Finetuning bert model on imdb dataset takes ~20mins/epoch while bert finetuning with adapter takes ~12mins/epoch. The first case has 109M trainable parameters and the bert+adapter has less than 2M trainable parameters.

None

calpt commented 2 years ago

Hey @macabdul9, while the number of trainable parameters is much lower when training adapters compared to fine-tuning the full model, the training samples still have to be passed throught the full model on each training run. Training time is faster, however, because we don't need to compute gradients for all parameters during the backward pass, yielding mentioned performance increases. You can find a lot more analysis on training/ inference time and efficiency in this paper: https://aclanthology.org/2021.emnlp-main.626.pdf

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.

adapter-hub-bert commented 1 year ago

This issue was closed because it was stale for 14 days without any activity.