Convert Roberta to Longformer does not keep token_type_embeddings shape and weights

dysby commented 1 year ago

Hello, while researching Longformer conversion, I came across your repo.

In convert_roberta_to_longformer you do not copy the token_type_embeddings from Roberta source model to the new Longformer.

Following the example notebook I checked the new Longformer "token_type_embeddings" will be of size ('token_type_embeddings.weight', torch.Size([2, 768])), instead of my source model size ('token_type_embeddings.weight', torch.Size([1, 768])),

I'm using transformers==4.26.1 and when you construct a new Longformer, the token_type_embeddings will be initialized with a Embedding(2, 768) size instead of Embedding(1, 768). The source weights will not be copied, and the embedding will not be the same.

I'm not sure why new transformers Longformer have this embedding size, but I'm thinking that it will cause problems with sentence similarity (in my use case I'm converting a SentenceTransformer built on XLMRoberta to Longformer).

Also, You do not copy the position_ids, and the lasts' notebook cells with comparisons between source and new Longformer will not be the same. The new model does not have position_ids, I think this does not cause any problem because the forward pass of the model will compute absolute position IDs if their empty.

You have any insight on this token_type_embedding issue?

LennartKeller commented 1 year ago

Hi @dysby, Easy things first: Since I created this repo, the buffered position_ids have been removed from the LongformerEmbeddings-module (Commit: https://github.com/huggingface/transformers/commit/eb1493b15db2019c93e365219b517fb44e313aaf), and now they are just created on the fly. So there is no need to worry about them anymore. I just ran the example notebook with the latest transformer version (4.27.4), and it worked fine, now showing the token-embeddings weights as the first entry in the embeddings.state_dict.

To your second question: You are right; my code does explicitly ignore the token_type_ids. The reason for that is that, as you observed, they can have different shapes, which makes copying them a little bit more complex, and - I think - back then, I felt that they would only be used if you pass different token_type_ids to the model, which did not apply for my use case. Since, in any case, you'll need to continue masked language modeling after transfer to (re-)learn the extended positional embeddings and adopt the whole model to the sparse self-attention pattern, I would guess that ignoring the token_type_ids might not be this big of an issue in terms of downstream performance. But I never inspected it explicitly.

dysby commented 1 year ago

Thanks, for quick reply.

I'm looking at longformer conversion and every little bit of change in parameters because I'm having trouble with sentence similarity performance. I'm replicating results for keyphrase extraction, and the algorithm does a better job while using document and kp candidate embeddings from a vanilla sentence transformer (max 128 tokens) than from a longformer (max 4096 tokens). This is unrelated to your repo, thanks for your time.

LennartKeller / roberta2longformer

Convert Roberta to Longformer does not keep token_type_embeddings shape and weights #1