Open svjack opened 3 years ago
https://github.com/huggingface/transformers/issues/876#issuecomment-515355840 it is about this shortcoming
Or you think this process should be done in "Semantic Search" stage and use Unsupervised methods to split doc and sort filter them ?
For Cross-Encoders, you can use the Longformer (and similar) documents, that allow to process longer documents.
For Bi-Encoders: So far there is no good approach how to present longer documents in a vector space. The most common way is to break down the document into smaller sections. You can either split by paragraphs or you have a fixed sized window and e.g. always encode 100 tokens to a vector.
For retrieval, you retrieve passages. Then you can e.g. count to which documents they belong. So for example you retrieve 100 passages, and then create a count to which docs they belong. You then return the top 10 documents from which the most passages were retrieved.
@svjack You can find some ideas here
Hello! any updates regarding the training Bi-Encoders for long documents?
as the title suggest. If de candidate documents are too long overflow the max_length of tokenizer setting or model setting in CrossEncoder how can i perform this kind of task ? a straightforward idea is to use attention on list of sentence range to combine the info among entire doc (Some people use it in RNN) Or only split doc to small parts and apply reduce_max operation as agg op to final combine loss. Or you have some suggestions? Can you share me some related projects or papers about this kind of problem ?