UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.97k stars 2.45k forks source link

SBERT for using semantic search using sentence_transformers #2251

Open alexaaronruban opened 1 year ago

alexaaronruban commented 1 year ago

Hi everyone, wanted to know if I can pre-train SBERT model for assymetric analysis on my domain specific data for 'Semantic search' of large amount of documents?

My documents are as follows- 1) The documents run into the hundreds. 2) Each document runs into few thousands of words(More than 512 tokens for SBERT)

As I am new to Transformer models, Wanted to know if I can pre-train SBERT model and also fix the tokens issue, open to hear some new approaches?

Thank you in advance

carlesoctav commented 1 year ago

You can perform fine-tuning on your domain dataset. Choose the loss function based on the structure of your data.

Unfortunately, the limit on words cannot be changed without consequences, as it is the default behavior of the original pretrained model. There will be a significant drop in performance if you try to change it.

If you want to change the maximum number of words the network can handle, you need to pretrain your base model again with MLM (masked language model) - which requires training on the entire internet, basically, if you want to match the SOTA pretrained model - and then fine-tune it on your dataset for semantic search.

The best way to handle documents with more than 512 words or tokens is to chunk them into smaller parts. This way, we can perform semantic search on each chunk individually.

FahadEbrahim commented 9 months ago

@carlesoctav Thank you for the reply. Is it possible to have an example on your last paragraph?

The best way to handle documents with more than 512 words or tokens is to chunk them into smaller parts. This way, we can perform semantic search on each chunk individually.