RUCAIBox / TextBox

TextBox 2.0 is a text generation library with pre-trained language models
https://github.com/RUCAIBox/TextBox
MIT License
1.08k stars 117 forks source link

PTG using same key #359

Closed minji-o-j closed 1 year ago

minji-o-j commented 1 year ago

In the paper, k^c_z (cluster key) and k^p_t (prompt key) exist, respectively, but the code uses the same key.

  1. Are the cluster keys and prompt keys written in the paper actually the same thing?
  2. In what part of the code are clusters considered? Currently the key doesn't appear to be cluster specific as it is updated with every full data.

image

 prompt_embeds = self.lam * self.MHA(task_query, key, value) + (1 - self.lam) * self.MHA(input_query, key, value)
StevenTang1998 commented 1 year ago

I have replied your questions in your email. Thanks for your questions!

minji-o-j commented 1 year ago

@StevenTang1998 Hello, I'd like to ask you additional questions separately from the mail.

What is learned in the PTG training process?

  1. only task query and keys
  2. BART fine-tuning + task query and keys

When I first read the paper, I thought it was 1,

I'm confused because the code says self.model.requires_grad_(True) and the paper says the learning rate of "BART".

Could you give me an answer as to which one is correct?

StevenTang1998 commented 1 year ago

Hi, @minji-o-j during prompt pre-training, we only train the query and keys. When fine-tuning on the downstream tasks, we tune the prompts and the BART model. More details can be found Page 6 in our paper.

minji-o-j commented 1 year ago

Then, is the process of learning a PTG for a specific task a two-stage process?

(1) In the process of obtaining tilde p, query and keys are trained using the Frozen BART model. (2) Fine-tuning BART using tilde p obtained in stage (1)

StevenTang1998 commented 1 year ago

Yes, and the (1) is optional if you use existing trained prompts.

minji-o-j commented 1 year ago

If so, it is impossible to obtain the paper's experimental results immediately by executing the following command in the current code, and is it correct to reproduce it through minor modifications?

python run_textbox.py --model=PTG --dataset=cnndm --model_path=facebook/bart-large (I used the command written here)


The reason for thinking so is that when learning queries and keys in PTG, self.model.requires grad_ is set to True (https://github.com/RUCAIBox/TextBox/blob/2.0.0/textbox/model/ptg.py#L43). As of now, BART learning + query and keys learning are done simultaneously.

After changing this part(`self.model.requiresgrad(True)) to False, learn the query key and save the tilde p, should I do BART fine-tuning for the same target task train set when learning again? (set self.model.requires grad_(True) and use fixed tilde p value instead of prompt_embedding metrics)

Please let me know if anything is wrong

StevenTang1998 commented 1 year ago

You can obtain the paper's experimental results immediately by executing the following command:

python run_textbox.py --model=PTG --dataset=cnndm --model_path=facebook/bart-large

We have provided the pre-trained prompt source.

minji-o-j commented 1 year ago

Then, is the process of learning a PTG for a specific task a two-stage process?

(1) In the process of obtaining tilde p, query and keys are trained using the Frozen BART model. (2) Fine-tuning BART using tilde p obtained in stage (1)

If so, is only (2) executed when this command is used?

StevenTang1998 commented 1 year ago

yeah

minji-o-j commented 1 year ago

Then, is the provided prompt source not a source prompt for the source prompt pool, but a tilde p for 14 tasks that have already completed learning (excluding itself) for 13 tasks?

However, looking at the code, it appears that the provided prompt goes into the source task.

It was understood that the source task is used in the process of obtaining tilde p.

Please let me know if there is anything wrong with my understanding!!

StevenTang1998 commented 1 year ago

You can download it and utilize torch to load it. It contains the learned prompt for each task (i.e., 14 tensors of shape [200, 1024]).

minji-o-j commented 1 year ago

Taking the pc dataset as an example, the source prompts for the same target task (pc) are different in cross-task and cross-dataset experiments.

In the case of the 14 prompts provided, are the tilde p that went through the process presented in the thesis for all source tasks (13) and the tilde p used in the experiment separate?

StevenTang1998 commented 1 year ago

Sorry, I may not understand your question. Maybe you can find solution here. We have provided different options for source tasks.

minji-o-j commented 1 year ago

Oh if so

The source prompt is derived using the Frozen BART model (multi-key memory network not used). Isn't tilde p obtained by utilizing "source prompts" and an adaptive attention mechanism?

StevenTang1998 commented 1 year ago

Yes, the source prompt is derived using the Frozen BART model (multi-key memory network not used). And tilde p is obtained by utilizing "source prompts" and an adaptive attention mechanism

And it is my mistake. The prompt source we provided is the P = {p1, . . . , pt, . . . , pT }.

minji-o-j commented 1 year ago

Then, is the process of learning a PTG for a specific task a two-stage process?

(1) In the process of obtaining tilde p, query and keys are trained using the Frozen BART model. (2) Fine-tuning BART using tilde p obtained in stage (1)

If so, I guess I need to start with (1) to train the PTG since the provided prompt source is the source prompt.

python run_textbox.py --model=PTG --dataset=cnndm --model_path=facebook/bart-large

However, using this command seems to train both BART and the prompt(query and keys) at the same time.





If so, it is impossible to obtain the paper's experimental results immediately by executing the following command in the current code, and is it correct to reproduce it through minor modifications?

python run_textbox.py --model=PTG --dataset=cnndm --model_path=facebook/bart-large (I used the command written here)

The reason for thinking so is that when learning queries and keys in PTG, self.model.requires grad_ is set to True (https://github.com/RUCAIBox/TextBox/blob/2.0.0/textbox/model/ptg.py#L43). As of now, BART learning + query and keys learning are done simultaneously.

After changing this part(`self.model.requiresgrad(True)) to False, learn the query key and save the tilde p, should I do BART fine-tuning for the same target task train set when learning again? (set self.model.requires grad_(True) and use fixed tilde p value instead of prompt_embedding metrics)

Please let me know if anything is wrong

So I asked like that, is it right to proceed with learning as I thought?

Any help would be appreciated.

StevenTang1998 commented 1 year ago

If you want to conduct the (1) step. Our provided code hasn't supported that. Maybe you should modify existing code to achieve your goal.

minji-o-j commented 1 year ago

Thank you for answer.

Also, in the current code, when an instance is entered, "task information" (ex: summarization) is also entered as the input of the model. (prompt + task description + input sentence)

  1. Did you give input like this in the experiment in the actual paper? Or is the code just part of an unwritten experiment?
  2. Did you use the same method when training the source prompt?
  3. Why do we apply these shapes to the BART model?
StevenTang1998 commented 1 year ago
  1. We do not use the task description in the paper. It is a default setting of TextBox. You can remove that but it has little impact on the result.
  2. We do not use the task description during prompt training.
  3. I may not understand this question.
minji-o-j commented 1 year ago

In the paper, the "Cluster" key and "Prompt" key were used. But in the current code, the same key is passed to the MHA function. (link)

 prompt_embeds = self.lam * self.MHA(task_query, key, value) + (1 - self.lam) * self.MHA(input_query, key, value)
  1. Using the current formula, can we experiment with PTG's second ablation study, the "PTG without prompt cluster"?
  2. Is it correct that (1) the key learned in the same cluster unit (key learned with multiple tasks in the same cluster) and (2) the key learned with one task were used in the experiment of the actual thesis, rather than using the same key repeatedly as in the current code?
StevenTang1998 commented 1 year ago

Sorry for late response, we utilize the same key in practice.