Fine-Tuning STS and semantic search multilingual HF transformer

UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT

https://www.SBERT.net

Apache License 2.0

14.73k stars 2.43k forks source link

Fine-Tuning STS and semantic search multilingual HF transformer #1196

Closed Matthieu-Tinycoaching closed 2 years ago

Matthieu-Tinycoaching commented 2 years ago

Hi,

For the need of exporting transformer model to ONNX format for inference, I use a multilingual sentence-transformer model based on the HF transformers library (separate tokenizer and model + mean pooling layer) for semantic text similarity and semantic search.

Is it possible to fine-tune this model with # https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/sts/training_stsbenchmark_continue_training.py despite not using the sentence-transformer library directly?

If not is there a workaround to enable to do this? (like fine-tune sentence-transformer model then split tokenizer/model before ONNX export)

Thanks!

nreimers commented 2 years ago

I would first fine tune the model an then export to onnx

Matthieu-Tinycoaching commented 2 years ago

Thanks @nreimers I would have done sequentially the steps you said.

The problem is that fine-tuning sentence-transformer model seems to be clear following the steps at https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/sts/training_stsbenchmark_continue_training.py

But for exporting model to ONNX format I use the following command:

    convert_graph_to_onnx.convert(
        framework="pt",
        model=model_name,
        output=model_pth,
        opset=12,
        tokenizer="xlm-roberta-base",
        use_external_format= False,
        pipeline_name= pipeline_name,
    )

which is based on HF implementation of sentence-transformer separating tokenizer from model, itself not having mean pooling step. How could I convert then fine-tuned sentence-transformer model into tokenizer and model as expected by HF ?

Thanks!

nreimers commented 2 years ago

When the model is saved, you get a regular transformers and tokenizers models which you can then convert

Matthieu-Tinycoaching commented 2 years ago

That sounds great @nreimers! So, regular transformer model without mean pooling?

Could you give me an idea of how many minimum training examples would be needed to gain from fine-tuning pre-trained multilingual sentence-transformer for STS and semantic search?

nreimers commented 2 years ago

Correct.

Number of examples depends on how complex is your domain. With a simple and narrow domain 1k examples are helpful.

With a broad domain that spans over basically all topics (physics, math, sports, gaming, dating, programming,...) you need a lot more examples. If you don't have examples for eg math, the model will not work that well for math queries.