Open ariharasudhanm opened 5 months ago
Firstly, fine-tuning the entire encoder would lead to a degradation of the original ViT's capabilities, so we opted for adapter fine-tuning instead. Secondly, the efficiency during fine-tuning with adapters did not decrease to an intolerable level; for instance, the FPS remained acceptable. Lastly, the adapter layer updates parameters only during the first iteration of each batch, and subsequent iterations do not update them, thus maintaining training efficiency. If you wish to reduce the number of parameters further, you can increase the down-sampling rate, such as to 0.75.
If am not wrong the proposed adapter contains 183M parameters when you compare this with the VIT-B encoder which is composed of 63M params approximately. How can you claim that your adapter is efficient than fine tuning the whole encoder itself?