adapter-hub / adapters

A Unified Library for Parameter-Efficient and Modular Transfer Learning
https://docs.adapterhub.ml
Apache License 2.0
2.58k stars 346 forks source link

New Adapter modules: K-Adapter #274

Open geblanco opened 2 years ago

geblanco commented 2 years ago

🌟 New adapter setup

Hi! It would be great to easily add new Adapter Blocks. As far as I know, right now is only possible to use the standard Houlsby or Pfeiffer adapters, which is great, but very limited.

An easier way to add Adapters processing would be great, though given the size of HF/Adapters code base seems a bit overwhelming to me. What do you think?

Model description

In K-Adapters, they substitute the feed forward + non linearity module of the standard adapters module with another transformer head (or heads), how difficult could that be to implement?

Open source status

ianupright commented 2 years ago

In K-Adapter, it seems like what they are doing is adding a number of Transformer layers to the end of the chain, which gets its input from a number of specified layers (hidden states). This might not be that clean to add into AdapterHub because:

geblanco commented 2 years ago

Hi @ianupright,

Yes effectively, that is the case, it is just about inserting a specific layer (other that feed forward) after certain given layers. Taking a look at the code yielded similar conclusions for me.

I solved it in a quick but not very elegant way, you can see the changes here. I took the following steps:

I agree that doing this right would probably mean a major refactor, the main question I see here is: Will other adapter architectures become as popular and successful as the core ones? Is it worth it to enable easy plugable layers?

calpt commented 2 years ago

Hey @geblanco and @ianupright,

since the release of v3 of this library, we're now more and more moving towards integrating alternative adapter/ efficient fine-tuning architectures beyond simple bottleneck adapters (e.g. prefix tuning, LoRA etc.). While we currently haven't planned to add K-Adapters from our side, we'd be happy to have it integrated (and happy to help) in case you're interested in contributing.

I haven't looked into the K-Adapter implementation in detail yet, but from what I've seen in the official implementation, they basically place the adapter layers in a separate model on top of the original Transformer (as opposed to injecting the layers into the forward pass). They then use the hidden states returned by the Transformer model to access the layer outputs of the Transformer at specific locations. (Hidden states of all layers can be returned using output_hidden_states). This seems to be the most straightforward approach as it looks similar to adding a prediction head on top of the model implementation-wise.

geblanco commented 2 years ago

Hi @calpt,

I see, looks like a major release on your side, cheers!

I agree that the most straightforward way of adding K-adapter to the mix would be implementing adapter layers as separate models and then using the outputs from a transformer in specific locations as input. In that specific scenario, I don't see the necessity of using adapter-transformers lib, do you?

How would you integrate this new architecture inside the library? As far as I understand, the problem is supporting to inject arbitrary layers instead of just non-linearities, isn't it? In this sense, I see injecting a BERT Layer as a similar use case to the new Compacter architecture.

What are your thoughts on this?

calpt commented 2 years ago

Hi @geblanco,

In that specific scenario, I don't see the necessity of using adapter-transformers lib, do you?

Agree on this. As K-Adapter is different to other adapter architectures in this sense and might not be composed with other methods that easily, it might not be worth the effort of integrating it into adapter-transformers compared to using the official code base directly.

How would you integrate this new architecture inside the library? As far as I understand, the problem is supporting to inject arbitrary layers instead of just non-linearities, isn't it?

Yes, exactly. I think, in general, an implementation could look similar to the implementation steps you already provided.