LM-Steer is more powerful gist tokens?

Glaciohound / LM-Steer

Official Code Repository for LM-Steer Paper: "Word Embeddings Are Steers for Language Models" (ACL 2024 Outstanding Paper Award)

https://arxiv.org/abs/2305.12798

MIT License

59 stars 11 forks source link

LM-Steer is more powerful gist tokens? #1

Closed ifsheldon closed 2 months ago

ifsheldon commented 2 months ago

Hi! Great job! I was thinking about a similar ideas for so-called "automatic prompt tuning". Do you know the work done in Learning to Compress Prompts with Gist Tokens ? I didn't see you mention it in your paper, but I have a feeling that gist tokens are another form of LM-Steer, which may be weaker.

In short, gist tokens compress a long prompt, which may be used to control wording style, into a few token embeddings by training on positive and negative samples. They then replace the long token sequence of the original prompt. Suppose the dimension is D, the token length of original prompt L, the gist token number G. Then essentially, the input prompt token "matrix" (stacking token embeddings) is reduced to D*G from D*L.

That D*G matrix to me looks very similar to the steer matrix when G = D, since the attention mechanism is (sort of) linearly transforming feature vectors.

Can you leave a little comment on gist tokens? Thanks!

Glaciohound commented 2 months ago

Hello:

Thanks for your interest in our work and for introducing the gist paper for discussion.

I was not familiar with the gist paper, so thanks for sharing it. After taking a look, I get some (can be wrong) feelings:

If we divide the papers by "input embedding modification" v.s. "output embedding modification," which I feel appropriate, then the gist might fall in the bin of the former one along with prefix tuning, prompt compression, etc. In contrast, the LM-steer falls in the latter category. Input embedding modification changes "how LMs read and perceive the input," while output embedding modification changes "how LMs generate", which are different perspectives of approaching the problem and different working styles.
Regarding the powerfulness, I would say LM-Steer is inherited limited by its $d\times d$ parameter, which is quite a small number while prompting, prompt tuning, and gist have unlimited expressibility and can learn more complex behaviors
Regarding the parameter efficiency, if I understand correctly, gist also learns the intermediate activates in each layer, which is why it can store such powerful prompt information
In summary, LM-Steer and Gist share similarities in operating in the word embeddings, and they also have differences in which embedding to work on (input v.s. output), efficiency ($d\times d$ v.s. $d \times G$+layer activations), interpretability (LM-Steer provides interpretation tools for embeddings and generations), flexibility (LM-Steer allows continuous and compositional control), transferability, expressibility (gist seems promising in storing arbitrary instruction as long as with a large enough $G$)

I hope this provides you with helpful information!

ifsheldon commented 2 months ago

@Glaciohound Thanks a lot for the detailed comment! That helps a lot!

Regarding the powerfulness, I would say LM-Steer is inherited limited by its $d × d$ parameter, which is quite a small number

This reminds me of LoRA. I think a LM-Steer matrix can be a special case of LoRA matrices? If we only apply LoRA to the value matrix of the last layer of a transformer. Then the embedding will be $\vec{ev}=QK^T(M{Value}+M{LoRA})\vec{v}$, in which $QK^TM{LoRA}$ can be $\epsilon W$.

$QK^TM_{LoRA}$ has a predefined rank $r \le d$ while the rank of $\epsilon W$ can be up to $d$.

Glaciohound commented 2 months ago

What you said makes sense and is a meaningful way to understand the data efficiency of LM-Steer and also to relate LM-Steer with LoRA from many perspectives.

From another perspective, even full parameter fine-tuning is a LoRA with the rank increased to full. Right? 😁

ifsheldon commented 2 months ago

Fair enough. LoRA is a very generalizable idea, but your work uncover a lot other perspectives nevertheless. Thanks for all your explanation!