Closed JunhyunB closed 4 years ago
@ngoyal2707
@JunhyunB Only inference with longer document won't work because the summarization model was finetuned on seqlen of 1024
.
What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on 512
seq_len and during fine tuning, we use 1024
seq_len. You can raise it further (let's say 2048
) and finetune.
For above, you would need to adjust positional embeddings by either:
1) learning them from start.
2) copy 512
from pretrained bart to first 512
of your 2048
positional embedding.
I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)
On the readme for CNN/DM, it says to use MAX_TOKENS=2048, but @ngoyal2707, you say it is 1024, and also here too https://github.com/pytorch/fairseq/issues/1474. Is the readme incorrect?
You can reset the position embedding to new length (ex 2048) and copy 1024 from model (the second half will be random initialized, while the first half is trained)... this is a common trick in summarization.
Thanks for the response yinhanliu. I wanted to know though, which hyperparameter setting was used to get the best results when fine-tuning on CNN/DM. Was it 1024 or 2048?
@loganlebanoff we only used 1024. Never tried 2048.
Thanks! I've created a pull request to fix the CNN/DM fine-tuning readme.
max_tokens, max_sentences, tokens_per_sample are different args. max_sentences is bsz, max_tokens is maximum allowed tokens in a batch and tokens_per_sample is max seq length in one instance. Current readme instructions are correct
Ok thanks, I understand the difference now
copy 512 from pretrained bart to first 512 of your 2048 positional embedding.
@ngoyal2707 I want to increase the max sequence length to be 2048 as you said. Can you give some hint as to how to do this? I see that the size of the positional embedding matrix is 1026 (rather than 1024) in the pretrained BART.
state['model']['encoder.embed_positions.weight'].shape
Out[37]: torch.Size([1026, 1024])
state['model']['encoder.embed_positions.weight']
Out[38]:
tensor([[-0.0043, -0.0042, 0.0029, ..., 0.0149, 0.0098, 0.0102],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0497, -0.2086, -0.1076, ..., -0.1564, -0.0135, 0.0566],
...,
[ 0.0027, 0.0022, -0.0051, ..., 0.0007, 0.0089, -0.0124],
[ 0.0046, -0.0024, 0.0026, ..., -0.0050, -0.0112, -0.0063],
[-0.0056, -0.0084, 0.0082, ..., -0.0017, -0.0039, 0.0105]],
dtype=torch.float16)
and similarly, the size is 2050 for the model I will be finetuning.
self.get_model().state_dict()['encoder.embed_positions.weight'].shape
Out[46]: torch.Size([2050, 1024])
Would I copy the over the parameters from [2 : 1026] to the second half, [1026 : 2050]?
I only tried once on this and I kept [1026:2050] random.
1026 was because bos + source (1024) + eos.
On Wed, Feb 26, 2020 at 5:49 PM Logan Thien Lebanoff < notifications@github.com> wrote:
copy 512 from pretrained bart to first 512 of your 2048 positional embedding.
@ngoyal2707 https://github.com/ngoyal2707 I want to increase the max sequence length to be 2048 as you said. Can you give some hint as to how to do this? I see that the size of the positional embedding matrix is 1026 (rather than 1024) in the pretrained BART.
state['model']['encoder.embed_positions.weight'].shape Out[37]: torch.Size([1026, 1024]) state['model']['encoder.embed_positions.weight'] Out[38]: tensor([[-0.0043, -0.0042, 0.0029, ..., 0.0149, 0.0098, 0.0102], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0497, -0.2086, -0.1076, ..., -0.1564, -0.0135, 0.0566], ..., [ 0.0027, 0.0022, -0.0051, ..., 0.0007, 0.0089, -0.0124], [ 0.0046, -0.0024, 0.0026, ..., -0.0050, -0.0112, -0.0063], [-0.0056, -0.0084, 0.0082, ..., -0.0017, -0.0039, 0.0105]], dtype=torch.float16)
and similarly, the size is 2050 for the model I will be finetuning.
self.get_model().state_dict()['encoder.embed_positions.weight'].shape Out[46]: torch.Size([2050, 1024])
Would I copy the over the parameters from [2 : 1026] to the second half, [1026 : 2050]?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pytorch/fairseq/issues/1685?email_source=notifications&email_token=AJQ6TRMGFZSJTSE7ANIFK7TRE4L3JA5CNFSM4KR2REB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCS2AA#issuecomment-591736064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ6TRMYTMJAB3MOFFLR3X3RE4L3JANCNFSM4KR2REBQ .
-- Best Regards,
Yinhan Liu
Graduate Student at the University of Texas at Austin
Thanks. I saw that the positions always start at 2, so I copied [2 : 1026] to [1026 : 2050]. I compared it to random initialization of the second half, and I got better scores on my specific application when copying vs random. Thanks again!
What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on
512
seq_len and during fine tuning, we use1024
seq_len. You can raise it further (let's say2048
) and finetune.
@ngoyal2707 I'd like to finetune BART on quite a different domain where the average sequence length of input documents is about 8000 tokens. Does BART support the lengths in this order? If not, is there a work-around to handle these cases?
Are the state['model']['encoder.embed_positions.weight']
weights the only ones I would have to resize and copy when trying to fine tune with max_source_positions=2048
?
With below modification, I can start the training but I'm not convinced if that makes sense:
state['model']['encoder.embed_positions.weight'] = torch.cat([
state['model']['encoder.embed_positions.weight'][:1025].clone(),
state['model']['encoder.embed_positions.weight'][1:].clone()
], 0)
Is this at all related to setting --encoder-embed-dim
?
@JunhyunB Only inference with longer document won't work because the summarization model was finetuned on seqlen of
1024
.What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on
512
seq_len and during fine tuning, we use1024
seq_len. You can raise it further (let's say2048
) and finetune.For above, you would need to adjust positional embeddings by either:
1. learning them from start. 2. copy `512` from pretrained bart to first `512` of your `2048` positional embedding.
I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)
@ngoyal2707 Hi, would you please point me to where in your code for finetuning BART you copy 512
from pretrained bart?
I need to finetune BART for a task similar to abstractive summerization but have longer sequences. Thanks
Thanks. I saw that the positions always start at 2, so I copied [2 : 1026] to [1026 : 2050]. I compared it to random initialization of the second half, and I got better scores on my specific application when copying vs random. Thanks again!
@loganlebanoff Would you please share what exact changes you made to finetune this model on new dataset with longer sequence? appreciate that.
After this line: https://github.com/pytorch/fairseq/blob/411531734df8c7294e82c68e9d42177382f362ef/fairseq/trainer.py#L202
I added the following code:
encoder_pos = state['model']['encoder.embed_positions.weight']
to_append = encoder_pos[2:]
new_encoder_pos = torch.cat((encoder_pos, to_append))
state['model']['encoder.embed_positions.weight'] = new_encoder_pos
Thanks for the reply @loganlebanoff . And also changing max_source_positions
to 2048, right?
Did you realize you gain from this trick for modeling long sequences?
And just a quick Q, why not using encoder_pos[1:-1]
?
Right, yes I changed max_source_positions to 2048. I still used it on CNN/DM, but a different setup than doing regular summarization. For my setup, I got slightly better performance by copying the positional embeddings to the last 1024 rather than randomizing the last 1024 (for both settings, I used max_source_positions=2048).
I took a look at https://github.com/pytorch/fairseq/blob/7a6519f84fed06947bbf161c7b66c9099bc4ce53/fairseq/utils.py#L191 Which says positions start at padding_idx+1, and when debugging, the padding_idx was 1. So I assume positions start at 2. This was confirmed for me when I took a look at the positions variable that's created, and it seems to start at 2. Padding is 1. I'm not sure what the 0 index is for...
In[3]: positions
Out[3]:
tensor([[ 2, 3, 4, ..., 769, 770, 771],
[ 1, 1, 1, ..., 697, 698, 699],
[ 1, 1, 1, ..., 689, 690, 691],
...,
[ 1, 1, 1, ..., 467, 468, 469],
[ 1, 1, 1, ..., 463, 464, 465],
[ 1, 1, 1, ..., 354, 355, 356]], device='cuda:0')
I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)
Can you please show, how can we increase it to take more than 1024 input tokens.
❓ Questions and Help
Does BART support more than 1024 tokens in inference of summarization task? For the long text like novel, does BART use all of the input to generate summary? or just use first 1024 tokens and ignore others?
Before asking:
What is your question?
Code
#### What have you tried? #### What's your environment? - fairseq Version (e.g., 1.0 or master): - PyTorch Version (e.g., 1.0) - OS (e.g., Linux): - How you installed fairseq (`pip`, source): - Build command you used (if compiling from source): - Python version: - CUDA/cuDNN version: - GPU models and configuration: - Any other relevant information: