Closed pugantsov closed 1 year ago
The code you provided seems to be correct. In general, the number of parameters trainable per method is not necessarily reflected in the required training time for a method. During training, the model has to propagate the gradients down to the first module that requires gradient updates, even if many modules in between don't require gradients. Larger speedups therefore could be obtained by leaving out adapter modules in the earlier model layers. There's some analysis on that in this paper: https://aclanthology.org/2021.emnlp-main.626.pdf.
This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.
This issue was closed because it was stale for 14 days without any activity.
Environment info
adapter-transformers
version: 3.1.0Details
I don't seem to be achieving much speedup with adapters thus far and I'm unsure what it is that I'm doing wrong. I upgraded to 3.1.0 and I've tried using the IA3Config which trains a fraction of the parameters than the PfeifferConfig. To my surprise, it's still taking me 10 minutes or so per epoch on 25,000 samples which is roughly the same time as the PfeifferConfig.
For my model, I use a custom BERT head with an additional layer and some modifications (just things like mean pooling etc, nothing particularly intensive) and I follow the Colab notebook in setting up the following:
Am I misunderstanding the parameter efficiency aspect w.r.t adapters in general or am I implementing something incorrectly?