UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.48k stars 2.5k forks source link

Best method to find similarities in large document. #1673

Open bxff opened 2 years ago

bxff commented 2 years ago

I am fairly new to ML/AI so I apologise before hand if I misunderstood things.

Q: Given multiple large document, find the similarity of a given document to all the other documents.

To answer this question I have searched a bit, but haven't found the best answer yet. I found that vector embeddings can be a good metric to find semantic similarity, which is exactly what I am looking for, but most models have a max sequence length of 512 tokens, not enough to fit my use-case.

In general there are four solution to the problem of sequence length I have gotten:

  1. Splitting the documents into sentences and averaging the formed vector.
  2. Using longformers.
  3. Using BM25 / TF-IDF to find similarities on large document compared to using SBERT.
  4. BERTScore way of finding similarities through many-to-many comparison.

The first solution has the flaw that with more averaging the worst the results will get, they work fine for similar sentences but in my case this is not optimal at all. The second solution is also not viable as my documents cannot fit the 16k max sequence size of the longerformer linked.

The third and the forth solution seem viable, I haven't looked into BM25 / TF-IDF as a replacement for semantic similarity, primarily because I thought they would yield bad results, but its something to at-least consider. Lastly using a similar model to BERTScore seems like the right way to go for me, if I understood this correctly we would split the documents into sentences, and find similarities of each sentence to the sentences of the other document, and then find the average of the maximum similarity of each sentence compared, optimally using the average IDF to weight the importance of the tokens as in this issue.

Though I am still uncertain, all of these solutions do account for something, but what do you think would be the best solution and are there some resources I can follow.

As per my knowledge Mem X has a feature where they get similar documents through finding embeddings from Open AIs embedding models, and then finding similar documents using cosine similarities, finally using their own clustering and length normalization algorithms to re-rank the list of similar documents. Maybe this could be another way to find best similarities, which haven't been considered yet.

Thanks a lot!

ChildishChange commented 2 years ago

Maybe you can try these solutions:

  1. get top n TF-IDF words and calculate bag of words vector similarity
  2. LDA
  3. get top n TF-IDF words and their tf-idf scores, then calculate the weighted average of word embeddings as the document embedding
  4. split large document into several paragraphs and pooling(mean or max) with the SBERT outputs of the paragraphs as the document embedding