UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.87k stars 2.44k forks source link

Seeking Solutions for Encoding Long Texts with Sentence Transformer Without Truncation #2876

Open NoahAi25 opened 1 month ago

NoahAi25 commented 1 month ago

Hello everyone,

I am seeking solutions to a problem I am facing:

I would like to encode a text that is longer than the model's input limit without truncating my text. One solution could be to split my text into subtexts acceptable by the model, encode them, and then average the embeddings. I understand that this might have a negative impact on performance, but in my case, it is necessary as all the information is crucial for the context. If anyone has another idea or knows of code that can perform this process, I would be very interested.

I would like this feature to be implemented with sentence_transformer as the package offers many advantages.

Thank you in advance for your help!

ir2718 commented 1 month ago

Hi,

have a look at this issue https://github.com/UKPLab/sentence-transformers/issues/2596. If you want more ideas you can have a look at papers published at relevant conferences (eg. SIGIR, ECIR, all ACL conferences).

NoahAi25 commented 1 month ago

Thank you for your response. As you mentioned in the post, I am working on articles that exceed the maximum accepted length. The solution of averaging the embeddings and training with them is the one I want to implement. I just want to know if it is possible to adapt the sentence-transformer code to retain all the features available in the library. I also wanted to know if there is any code already available that does this work.

tomaarsen commented 1 month ago

Hello!

To my knowledge this is still a bit of an unsolved problem. A common solution is to perform chunking and not combine the embeddings from those chunks, but just choose that larger document if the (multiple) embeddings from that document rank highly.

Another common (but very risky and potentially flawed!) solution is embedding averaging. One of the big problems is that there is no weighing whatsoever: the first chunk is usually most important to explain the document, but all later chunks are averaged with equivalent weights. Weighing by order (e.g. higher priority for the first chunk) is also not ideal, as it's too restrictive to the assumption that the first chunk is most important.

Another big problem with averaging is the law of large numbers. The more chunks per document, the more similar the averaged embeddings will become. For example, if you roll a dice 2 times and average the results, and then do it again, the 2 averages can reasonably be fairly big, e.g. larger than 1. But if you roll the dice 30 times, average the results, and then do it again, the 2 averages are likely to be extremely similar, e.g. <0.1 difference.

I'm unaware of research on the best solution, but my intuition is to not combine embeddings naively, but instead to chunk documents, perform retrieval (or whatever task you're doing) on the chunks instead of the documents, and then recover the document given the retrieved chunks.

NoahAi25 commented 1 month ago

Alright, thank you for your help and intuition. Indeed, this remains an unresolved issue. Nevertheless, I wanted to refine my request. My concrete problem is that I want to train a model (multilanguage-e5) on a set of articles. Therefore, I have a question linking to an article. However, generally, the article lacks context. To address this, I add supplementary text (texts that are cited in the base text). Consequently, the article is structured as follows: [cited article 1, cited article 2, ..., base article]. The problem with this is that the new article exceeds 512 tokens (the maximum tokens). As a result, the current training of sentence-transformer truncates, leading to a loss of information. I would like to know if we can solve this problem with sentence-transformer or by modifying part of the code.

tomaarsen commented 1 month ago

Thank you for providing some extra context! I think I would consider one of two approaches:

Base model first, context after

You can add your supplementary material following the base article, e.g. [base article, cited article 1, cited article 2, ...]. Here, the (less important but still meaningful) extra context is truncated rather than the crucial base article. This is a rather simple fix and then training can proceed as normal.

Per-article embeddings with modified retrieval

Rather than updating each article with supplementary materials, it might make sense to get embeddings for each article separately, and then combine them: not via combining embeddings, but by combining the similarity scores.

For training, you could create these training pairs:

("question_A", "answer_A_base_article")
("question_A", "answer_A_context_one")
("question_A", "answer_A_context_two")
("question_A", "answer_A_context_three")
...
("question_B", "answer_B_base_article")
...

rather than:

("question_A", "answer_A_context_one + answer_A_context_two + ... + answer_A_base_article")
("question_B", "answer_B_context_one + answer_B_context_two + ... + answer_B_base_article")

This will (hopefully) allow you to 1) create a lot of training pairs which 2) allow higher scores if the context is vaguely related. You can use MultipleNegativesRankingLoss with this. I also recommend using batch_sampler=BatchSamplers.NO_DUPLICATES (docs) as I recommend that when you have duplicate questions in your training.

After training, you can then retrieve by embedding the query and getting the similarity between the query embedding and the embeddings from each individual article (dot product between the normalized embeddings should give you similarity scores). Then, rather than just taking the similarity score from each embedding as the sorting criteria, you can use the similarity scores to create a new score, e.g. with:

article_score = base_article_score * alpha + mean(context_article_scores * beta) * charlie

This will include the context in the overall article score, but avoids the problem with averaging embeddings: we instead average similarity scores. In this formula, alpha determines the importance of the base article, beta is optional and can determine the importance of each of the context articles (e.g. perhaps you somehow know that some context articles are less important), and charlie determines the overall importance of the context articles.

I do want to preface that the training objective (MNRL for example) does not perfectly align with your "article_score" formula, which is a risk. I can't guarantee that this will actually do better than just training with ("question", "base_article") pairs.

NoahAi25 commented 1 month ago

Thank you very much for your two suggestions; they are quite interesting, and I will definitely take some time to consider them. I’m not sure if this is possible, but do you have a contact, such as an email or another way, that I could use to reach out to you privately for your feedback on my process of creating a specialized embedding?