Adapter Configuration - Expert of Mixture

Hello,

I am a student of computer science at the LMU Munich and I am working on fine-tuning vision-language models in a multitask setting as part of my master thesis. The focus of my work is on Mixture of Expert models, which select and combine relevant adapters for different inputs.

During my research, I came across your adapter framework. I have already read some of the documentation, but I am still unsure whether it would be possible to “simply” add such a meta module. I would like to contribute to this framework with such a module, but I would appreciate a little help on where to start.

Beyond that, I would like to ask whether existing models can be extended. Specifically, a combination of CLIP + Llama would allow me to rebuild the code of a paper (Octavius).

Hey @simon-lund!

Thanks for reaching out, this sounds very interesting. We'd be happy to see your work integrated into our library. Haven't studied your topic and the linked paper in detail yet, so these are only preliminary high-level answers:

During my research, I came across your adapter framework. I have already read some of the documentation, but I am still unsure whether it would be possible to “simply” add such a meta module. I would like to contribute to this framework with such a module, but I would appreciate a little help on where to start.

Yes, in principle it should be possible to integrate such new modules into the library. I think what you're trying to achieve most closely resembles a new composition block and could be implemented as a new composition block type, so I'll try to give some context here:

all composition block implementations are based on a common class mixin here.
each composition block comes with a method defined in this class, which is called during the model forward pass by each layer containing adapter modules. Your use case might be somewhat related to Average composition, implemented here. Maybe it makes sense to use this composition block as orientation in the beginning.
the abstract composition class mixin is implemented by method-specific mixins for each adapter method, e.g. for LoRA here. These specific mixins might customize the general implementation to each specific adapter method.
composition blocks are automatically integrated into all currently supported models, so ideally it should not be necessary to modify model implementations directly when adding a new composition block

Beyond that, I would like to ask whether existing models can be extended. Specifically, a combination of CLIP + Llama would allow me to rebuild the code of a paper (Octavius).

In general, both CLIP and Llama are already supported by our library. In principle it should be possible to compose these models using standard methods provided in Transformers, e.g. joining vision encoders and text decoders as described here. This might require some additional tweaking to work well with adapters though.

In general, our contributing guides on adding new model support and adding new adapter methods might also provide helpful context to this regard.

Hope this is somewhat helpful as general pointers. Happy to help with more specific questions/ issues related to your concrete use case!

adapter-hub / adapters

Adapter Configuration - Expert of Mixture #626