Best method to find similarities in large document.

I am fairly new to ML/AI so I apologise before hand if I misunderstood things.

Q: Given multiple large document, find the similarity of a given document to all the other documents.

To answer this question I have searched a bit, but haven't found the best answer yet. I found that vector embeddings can be a good metric to find semantic similarity, which is exactly what I am looking for, but most models have a max sequence length of 512 tokens, not enough to fit my use-case.

In general there are four solution to the problem of sequence length I have gotten:

The first solution has the flaw that with more averaging the worst the results will get, they work fine for similar sentences but in my case this is not optimal at all. The second solution is also not viable as my documents cannot fit the 16k max sequence size of the longerformer linked.

The third and the forth solution seem viable, I haven't looked into BM25 / TF-IDF as a replacement for semantic similarity, primarily because I thought they would yield bad results, but its something to at-least consider. Lastly using a similar model to BERTScore seems like the right way to go for me, if I understood this correctly we would split the documents into sentences, and find similarities of each sentence to the sentences of the other document, and then find the average of the maximum similarity of each sentence compared, optimally using the average IDF to weight the importance of the tokens as in this issue.

Though I am still uncertain, all of these solutions do account for something, but what do you think would be the best solution and are there some resources I can follow.

As per my knowledge Mem X has a feature where they get similar documents through finding embeddings from Open AIs embedding models, and then finding similar documents using cosine similarities, finally using their own clustering and length normalization algorithms to re-rank the list of similar documents. Maybe this could be another way to find best similarities, which haven't been considered yet.

Thanks a lot!

UKPLab / sentence-transformers

Best method to find similarities in large document. #1673