XiangLi1999 / PrefixTuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation
868 stars 158 forks source link

Possible mistake in prefix model parameter count? I am getting 15% not 2% like in the paper #17

Open jpilaul opened 2 years ago

jpilaul commented 2 years ago

Hi,

I calculated the number of parameters used in the embedding and linear layers of the prefix model from self.control_trans, self.control_trans_enc, self.control_trans2, wte, wte_enc, wte_2 and I am getting 62,1M. Since BART large is 406M, we should get 15% added parameters and not 2% like in table 2 of your paper.

I tried the following code: sum(p.numel() for p in list(self.model.control_trans.parameters())) which gives 20505376 or 20.5M using the hyparameters to replicate xsum results.

Here's the prefix model (Embedding is not included):

Sequential( (0): Linear(in_features=1024, out_features=800, bias=True) (1): Tanh() (2): Linear(in_features=800, out_features=24576, bias=True) )

There is such model for encoder inputs, decoder inputs and cross inputs. So, you have to multiply the 20.5M by 3 (see here: https://github.com/XiangLi1999/PrefixTuning/blob/cleaned/seq2seq/prefixTuning.py#L260-L279).

Thanks

XiangLi1999 commented 2 years ago

Hi,

The key point is that this parameter count (2%) corresponds to the number of parameters we actually need to store to disk. So the prefix model is an MLP that always take in a fixed input, therefore we don't need to store the MLP, we only need to store its output after we finish training.

If I remember correctly, the 2% story is when prefix length is 10.

jpilaul commented 2 years ago

Hi, Thanks for your reply

The key point is that this parameter count (2%) corresponds to the number of parameters we actually need to store to disk.

I see. I measured training parameters.

If I remember correctly, the 2% story is when prefix length is 10.

It's for XSUM in table 2 which requires a prefix length of 200 to work well (see figure 4 of your paper). Should it be higher than 2% for a length of 200?