Open jpilaul opened 2 years ago
Hi,
The key point is that this parameter count (2%) corresponds to the number of parameters we actually need to store to disk. So the prefix model is an MLP that always take in a fixed input, therefore we don't need to store the MLP, we only need to store its output after we finish training.
If I remember correctly, the 2% story is when prefix length is 10.
Hi, Thanks for your reply
The key point is that this parameter count (2%) corresponds to the number of parameters we actually need to store to disk.
I see. I measured training parameters.
If I remember correctly, the 2% story is when prefix length is 10.
It's for XSUM in table 2 which requires a prefix length of 200 to work well (see figure 4 of your paper). Should it be higher than 2% for a length of 200?
Hi,
I calculated the number of parameters used in the embedding and linear layers of the prefix model from
self.control_trans, self.control_trans_enc, self.control_trans2, wte, wte_enc, wte_2
and I am getting 62,1M. Since BART large is 406M, we should get 15% added parameters and not 2% like in table 2 of your paper.I tried the following code:
sum(p.numel() for p in list(self.model.control_trans.parameters()))
which gives20505376
or 20.5M using the hyparameters to replicate xsum results.Here's the prefix model (Embedding is not included):
There is such model for encoder inputs, decoder inputs and cross inputs. So, you have to multiply the 20.5M by 3 (see here: https://github.com/XiangLi1999/PrefixTuning/blob/cleaned/seq2seq/prefixTuning.py#L260-L279).
Thanks