How to encode very large text ( >9k words)?

cabhijith commented 4 years ago

Hi @nreimers , First of all, I would like to thank you for this great project. Your sentence transformers were much easier to implement that of hugging face. I would also like to thank you for being so active in solving the issues.

Here's my problem: I would like to get embeddings for some documents I have. These documents have over 9,000 words. As I understand, BERT has a hard limit of 510 words. From other issues I have read, I also understand that the runtime will increase quadratically with an increase in the input length.

Use Case: I am making an elasticsearch engine, specifically for legal judgments. I was planning to use the dense vectors feature. What I wanted to achieve is as follows. Let's say, someone, searches for Cases of person robbed. Now, this query should return judgements in which the man/women/person was robbed/mugged, belongings taken away etc.

I have tried using a synonym based approach, but that has yielded poor results. Here's a couple of ways I think I can solve the problem:

1) Summarize the text using a Deep Learning algorithm or something simple like TF-IDF and then encode them. This can be done for each judgement.

2) Split the judgements into smaller pieces. For example, a case with 8,000 words gets split into 16 parts, each is then encoded and indexed to ElasticSearch separately.

I would like your thoughts on the following methods and if there is something else that might work. I have also tried query expansion, but results were not as good as I expected.

Thanks,

nreimers commented 4 years ago

Hi @cabhijith I am happy to help.

You are right, BERT has a limit of 510 word pieces (about 300-400 words). For longer texts it is not practical, as runtime and memory requirement grows quadratic with the number of tokens.

There are some approaches that use "sentence embeddings" for information retrieval: https://openreview.net/forum?id=rkg-mA4FDr https://kentonl.com/pub/gltpc.2020.pdf

In the REALM paper they break down wikipedia into smaller passages, compute and embedding for them and indem them.

However, it is Google and they use a lot of GPU / TPU power to make it feasible. If you don't have a big GPU/TPU cluster at hand, these approaches are sadly not yet that feasible. Also, they trained their approaches with quite a lot of training data. It is unclear if these work if have little training data for IR.

Other approaches that scale with the sentence length, like average word embeddings or infersent, also have issues with longer documents. They often only work for sentences or very short paragraphs. For longer text, they do not yield sensible information.

If you have a lot of training data, you can maybe encode sentences (or short paragraphs) individually and use a sentence embeddings approach.

If you don't have a lot of training data, I still think that Elasticsearch BM25 is your best option. If you get a poor recall, I would look into query expansion.

Expanding the query is a lot easier than performing abstractive summerization and indexing it. So instead of "doc => summary" I would try to perform query => [alt. query1, alt. query2, alt. query3, ...]

You could combine different query expansion mechanism, for example 1) Replacing with synonyms 2) Analyzing the retrieved documents to find relevant words 3) Train some seq2seq approach that generates your alternative queries.

Best Nils Reimers

cabhijith commented 4 years ago

Hi @nreimers , You are correct, we don't have much GPU resources since we are a very early stage startup.

Sorry, but I couldn't understand the need for training the BERT model again. My understanding is fine-tuning the BERT model on your own training data will improve accuracy, right? I tried the out of the box solution and it is working pretty well. Anyways, we have about 6.5 billion words in our training data, so that will suffice.

I'll try encoding individual paragraphs, index them and see the results.

I'll also try out the different query expansion techniques. Synonyms don't work as they are not flexible/context-sensitive. Lastly, if you don't mind, can you please send me some papers/resources that implement models (seq2seq or otherwise) for generating alternative queries?

cabhijith commented 4 years ago

Also, I tried encoding 1000 words and it worked (as in I got a vector back). What am I doing wrong here? This is the code I used:

from sentence_transformers import SentenceTransformer 
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
corpus = ["1000 words"]
corpus_embeddings = embedder.encode(corpus)

I am really new to this, please excuse the silliness.

nreimers commented 4 years ago

Input is truncated at 128 tokens (word pieces). So only the start of you input is considered.

Sadly I'm not aware of any papers on query expansion (not my main field of expertise)

BERT is pre-trained, right. But to get good results, you still need to fine-tune it for your application. For information retrieval, you need quite a lot of data for fine-tuning until you achieve better results than with BM25 (from my experiences, of course, it varies always based on your application.

cabhijith commented 4 years ago

Oh! Makes sense. Thanks, for the help:)

akshaydhok07 commented 2 years ago

Hi @nreimers, Is there any sentence transformer with max_seq_length = 4096 ( long former based)? That will help to capture a longer context. Thank you!

nreimers commented 2 years ago

@akshaydhok07 Sadly not

sademakn commented 1 year ago

Hi, I am checking model=SentenceTransformer('naver/splade-cocondenser-ensembledistil', device='cuda') it takes 410 tokens, and the inference time is also good. Removing stopwords sometimes shrinks the text to a half token size, I know it may change the context, but I want to know your opinion about it. Is it acceptable to remove stopwords for large texts in semantic search? I just need to build a semantic search engine on about 10 million large documents, do I have to fine tune the model on my data or model.encode() is enough? In case that I need to fine tune the model, please give me a resource to read

Jmallone commented 1 year ago

Hi, I am checking model=SentenceTransformer('naver/splade-cocondenser-ensembledistil', device='cuda') it takes 410 tokens, and the inference time is also good. Removing stopwords sometimes shrinks the text to a half token size, I know it may change the context, but I want to know your opinion about it. Is it acceptable to remove stopwords for large texts in semantic search? I just need to build a semantic search engine on about 10 million large documents, do I have to fine tune the model on my data or model.encode() is enough? In case that I need to fine tune the model, please give me a resource to read

I have the same doubt is it preferable to remove the stop words?

UKPLab / sentence-transformers

How to encode very large text ( >9k words)? #147