Closed ifsheldon closed 2 months ago
Hello:
Thanks for your interest in our work and for introducing the gist paper for discussion.
I was not familiar with the gist paper, so thanks for sharing it. After taking a look, I get some (can be wrong) feelings:
I hope this provides you with helpful information!
@Glaciohound Thanks a lot for the detailed comment! That helps a lot!
Regarding the powerfulness, I would say LM-Steer is inherited limited by its $d × d$ parameter, which is quite a small number
This reminds me of LoRA. I think a LM-Steer matrix can be a special case of LoRA matrices? If we only apply LoRA to the value matrix of the last layer of a transformer. Then the embedding will be $\vec{ev}=QK^T(M{Value}+M{LoRA})\vec{v}$, in which $QK^TM{LoRA}$ can be $\epsilon W$.
$QK^TM_{LoRA}$ has a predefined rank $r \le d$ while the rank of $\epsilon W$ can be up to $d$.
What you said makes sense and is a meaningful way to understand the data efficiency of LM-Steer and also to relate LM-Steer with LoRA from many perspectives.
From another perspective, even full parameter fine-tuning is a LoRA with the rank increased to full. Right? 😁
Fair enough. LoRA is a very generalizable idea, but your work uncover a lot other perspectives nevertheless. Thanks for all your explanation!
Hi! Great job! I was thinking about a similar ideas for so-called "automatic prompt tuning". Do you know the work done in Learning to Compress Prompts with Gist Tokens ? I didn't see you mention it in your paper, but I have a feeling that gist tokens are another form of LM-Steer, which may be weaker.
In short, gist tokens compress a long prompt, which may be used to control wording style, into a few token embeddings by training on positive and negative samples. They then replace the long token sequence of the original prompt. Suppose the dimension is
D
, the token length of original promptL
, the gist token numberG
. Then essentially, the input prompt token "matrix" (stacking token embeddings) is reduced toD*G
fromD*L
.That
D*G
matrix to me looks very similar to the steer matrix whenG
=D
, since the attention mechanism is (sort of) linearly transforming feature vectors.Can you leave a little comment on gist tokens? Thanks!