UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.34k stars 2.48k forks source link

Can you give me some suggestions about long document Information Retrieval task #630

Open svjack opened 3 years ago

svjack commented 3 years ago

as the title suggest. If de candidate documents are too long overflow the max_length of tokenizer setting or model setting in CrossEncoder how can i perform this kind of task ? a straightforward idea is to use attention on list of sentence range to combine the info among entire doc (Some people use it in RNN) Or only split doc to small parts and apply reduce_max operation as agg op to final combine loss. Or you have some suggestions? Can you share me some related projects or papers about this kind of problem ?

svjack commented 3 years ago

https://github.com/huggingface/transformers/issues/876#issuecomment-515355840 it is about this shortcoming

svjack commented 3 years ago

Or you think this process should be done in "Semantic Search" stage and use Unsupervised methods to split doc and sort filter them ?

nreimers commented 3 years ago

For Cross-Encoders, you can use the Longformer (and similar) documents, that allow to process longer documents.

For Bi-Encoders: So far there is no good approach how to present longer documents in a vector space. The most common way is to break down the document into smaller sections. You can either split by paragraphs or you have a fixed sized window and e.g. always encode 100 tokens to a vector.

For retrieval, you retrieve passages. Then you can e.g. count to which documents they belong. So for example you retrieve 100 passages, and then create a count to which docs they belong. You then return the top 10 documents from which the most passages were retrieved.

bhavsarpratik commented 3 years ago

@svjack You can find some ideas here

tin9580 commented 1 year ago

Hello! any updates regarding the training Bi-Encoders for long documents?