allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.03k stars 271 forks source link

How to compare text similarity? #114

Open thesby opened 4 years ago

thesby commented 4 years ago

Should I fine-tune the model to compare text ? This model generate a matrix, not a vector for the long text. And different length of text may be different shape of output. Must I fine-tune this model to generate a text vector?

thesby commented 4 years ago

Any hint or suggestion ?

armancohan commented 4 years ago

You can take the representation corresponding to the first token in the sequence (<s>) as the aggregate representation of the entire sequence. Note that this requires fine-tuning (e.g., on a textual-similarity or classification dataset) so that the model learns to represent the entire sequence through this single vector.

thesby commented 4 years ago

Is there a maximum of sequence length?

armancohan commented 4 years ago

Currently maximum sequence length in the released models is 4096.

thesby commented 4 years ago

Can it be modified ? Or how can I deal with article more than 4096 tokens to be a vector ?

armancohan commented 4 years ago

Some possible approaches to deal with longer documents: 1- Simply truncate. 2- Break the document into smaller pieces of 4096, encode each one individually and pool the resulting vectors (e.g., average) to create a single vector. 3- Extend longformer's position embeddings to support longer sequences. You can use our copying position embedding trick described in Section 5 of our paper.

ibeltagy commented 4 years ago

about 3), we have a version of it implemented here https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

Finnyhudson commented 1 year ago

did you solve the way to compare text?I tried to use the siamese architecture to compute the simularity of the text-pair embeddings but it didnt go well :(