In section 7.4, it conducts an initialization experiment with real words. I am just wondering, does this initialization applies to prompts in every layer? Or just the prompts in the first layer? And how does this work together with the re-parameterization method since the input dimension of re-param is much smaller?
And I also noticed that in your code, instead of directly adding prompts to the input of each layer (as described in ur paper), what u actually did is appending vectors to key value matrices directly via the past_key_values argument. Just wondering, how does the initialization experiment work in this setup/implementation? Directly initialize the key/value vectors? But seems that the dimension doesn't match?
Hi, thanks for the great work!
In section 7.4, it conducts an initialization experiment with real words. I am just wondering, does this initialization applies to prompts in every layer? Or just the prompts in the first layer? And how does this work together with the re-parameterization method since the input dimension of re-param is much smaller?
And I also noticed that in your code, instead of directly adding prompts to the input of each layer (as described in ur paper), what u actually did is appending vectors to key value matrices directly via the
past_key_values
argument. Just wondering, how does the initialization experiment work in this setup/implementation? Directly initialize the key/value vectors? But seems that the dimension doesn't match?Thanks!