Glaciohound / LM-Steer

Official Code Repository for LM-Steer Paper: "Word Embeddings Are Steers for Language Models" (ACL 2024 Outstanding Paper Award)
https://arxiv.org/abs/2305.12798
MIT License
59 stars 11 forks source link

LM-Steer is more powerful gist tokens? #1

Closed ifsheldon closed 2 months ago

ifsheldon commented 2 months ago

Hi! Great job! I was thinking about a similar ideas for so-called "automatic prompt tuning". Do you know the work done in Learning to Compress Prompts with Gist Tokens ? I didn't see you mention it in your paper, but I have a feeling that gist tokens are another form of LM-Steer, which may be weaker.

In short, gist tokens compress a long prompt, which may be used to control wording style, into a few token embeddings by training on positive and negative samples. They then replace the long token sequence of the original prompt. Suppose the dimension is D, the token length of original prompt L, the gist token number G. Then essentially, the input prompt token "matrix" (stacking token embeddings) is reduced to D*G from D*L.

That D*G matrix to me looks very similar to the steer matrix when G = D, since the attention mechanism is (sort of) linearly transforming feature vectors.

Can you leave a little comment on gist tokens? Thanks!

Glaciohound commented 2 months ago

Hello:

Thanks for your interest in our work and for introducing the gist paper for discussion.

I was not familiar with the gist paper, so thanks for sharing it. After taking a look, I get some (can be wrong) feelings:

I hope this provides you with helpful information!

ifsheldon commented 2 months ago

@Glaciohound Thanks a lot for the detailed comment! That helps a lot!

Regarding the powerfulness, I would say LM-Steer is inherited limited by its $d × d$ parameter, which is quite a small number

This reminds me of LoRA. I think a LM-Steer matrix can be a special case of LoRA matrices? If we only apply LoRA to the value matrix of the last layer of a transformer. Then the embedding will be $\vec{ev}=QK^T(M{Value}+M{LoRA})\vec{v}$, in which $QK^TM{LoRA}$ can be $\epsilon W$.

$QK^TM_{LoRA}$ has a predefined rank $r \le d$ while the rank of $\epsilon W$ can be up to $d$.

Glaciohound commented 2 months ago

What you said makes sense and is a meaningful way to understand the data efficiency of LM-Steer and also to relate LM-Steer with LoRA from many perspectives.

From another perspective, even full parameter fine-tuning is a LoRA with the rank increased to full. Right? 😁

ifsheldon commented 2 months ago

Fair enough. LoRA is a very generalizable idea, but your work uncover a lot other perspectives nevertheless. Thanks for all your explanation!