UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.26k stars 2.47k forks source link

sentence bert model in onnx format #46

Closed rachel2011 closed 3 weeks ago

rachel2011 commented 5 years ago

I would like to convert sentence bert model from pytorch to tensorflow use onnx, and tried to follow the standard onnx procedure for converting a pytorch model. But I'm having difficulty determining the onnx input arguments for sentence bert model, I encounter TypeError: forward() takes 2 positional arguments but 4 were given. Suggestions appreciated! model = SentenceTransformer('output/continue_training_model') device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) dummy_input0 = torch.LongTensor(batch_size, max_seq_length).to(device) dummy_input1 = torch.LongTensor(batch_size, max_seq_length).to(device) dummy_input2 = torch.LongTensor(batch_size, max_seq_length).to(device) torch.onnx.export(model,(dummy_input0, dummy_input1,dummy_input2), onnx_file_name, verbose=True)

nreimers commented 5 years ago

Sadly I never worked with onnx.

In SentenceTransformer, the forward function takes in one argument: features (and the second in python is self).

Features is a dictionary, that contains the different features, for example, token ids, word weights, attention values, token_type_ids.

For the BERT model, I think your input must look like this: input_features = {'input_ids': dummy_input0, 'token_type_ids': dummy_input1, 'input_mask': dummy_input2}

And then:

torch.onnx.export(model,input_features, onnx_file_name, verbose=True)
ycgui commented 4 years ago

@rachel2011 Did you get any solutions to successfully convert sentence bert model to onnx format?

I'm also wondering how to feed text inputs into a converted onnx model. Could we do something similar as

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

by replacing the model to the converted onnx model? Any ideas?

nreimers commented 4 years ago

Hi @ycgui I started to add the models to HuggingFace Models Hub: https://huggingface.co/sentence-transformers

Huggingface also provides methods / scripts to convert models to ONNX.

I hope this helps.

ycgui commented 4 years ago

Thanks @nreimers. This is awesome!

SidJain1412 commented 4 years ago

Hey @ycgui I would be really thankful if you could share the code you used to convert the models to ONNX and then how you can encode sentences using that model. Thank you in advance

SidJain1412 commented 4 years ago

Hey @ycgui I would be really thankful if you could share the code you used to convert the models to ONNX and then how you can encode sentences using that model. Thank you in advance

As per my understanding shouldn't the pooled output after running the ONNX model match the output of encode using SentenceTransformers? This doesn't seem to be the case in my testing. (using the same model and tokenizer in both cases)

@nreimers , I would appreciate your help greatly

nreimers commented 4 years ago

Sadly I am not aware of ONNX format.

Here you can see an example how to load the models with native transformers code and how to apply mean pooling correctly (watch out for padding tokens): https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens

SidJain1412 commented 4 years ago

Thank you a lot @nreimers , applying the mean pooling correctly I was able to get sentence embeddings as expected!

I am thinking about making a small tutorial Notebook on how to use sentence transformers with ONNX + benchmarking it against the traditional PyTorch model, would that be useful in Examples?

nreimers commented 4 years ago

Yes, it would be great to have such tutorial.

SidJain1412 commented 4 years ago

Awesome, I'll make a PR regarding that soon :+1: Thanks for the awesome library!

SidJain1412 commented 4 years ago

Created a PR regarding this: https://github.com/UKPLab/sentence-transformers/pull/386

nreimers commented 4 years ago

great, I will have a look

cantwbr commented 3 years ago

@rachel2011: My response might be a bit late. I think the keys in your dictionary are wrong. For sentence-transformers in version 0.3.7.2, I downloaded the models (like bert-base-nli-mean-tokens) from here. Then I used input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'attention_mask': input_mask} and torch.onnx.export(model,input_features, onnx_file_name, verbose=True) to export the sentence bert model.

SidJain1412 commented 3 years ago

@cantwbr A simpler way is to use from transformers.convert_graph_to_onnx import convert that can be used to convert to an ONNX model. Refer to this PR.

codingliuyg commented 3 years ago

@rachel2011: My response might be a bit late. I think the keys in your dictionary are wrong. For sentence-transformers in version 0.3.7.2, I downloaded the models (like bert-base-nli-mean-tokens) from here. Then I used input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'attention_mask': input_mask} and torch.onnx.export(model,input_features, onnx_file_name, verbose=True) to export the sentence bert model.

hello,how do you define the input_ids、input_type_ids and input_mask,can you show your demo? My code as shown below does not pass。

batch_size=1 max_seq_length=128 device = torch.device("cuda") model.to(device) dummy_input0 = torch.LongTensor(batch_size, max_seq_length).to(device) dummy_input1 = torch.LongTensor(batch_size, max_seq_length).to(device) dummy_input2 = torch.LongTensor(batch_size, max_seq_length).to(device) input_features = {'input_ids': dummy_input0, 'token_type_ids': dummy_input1, 'attention_mask': dummy_input2}

@cantwbr Looking forward to your reply,thanks。

cantwbr commented 3 years ago

@codingliuyg: I didn't use LongTensors. Instead I generated ones-tensors.

input_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_type_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_mask = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'attention_mask': input_mask}
torch.onnx.export(model,input_features, onnx_file_name, verbose=True)
codingliuyg commented 3 years ago

@codingliuyg: I didn't use LongTensors. Instead I generated ones-tensors.

input_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_type_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_mask = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device)
input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'attention_mask': input_mask}
torch.onnx.export(model,input_features, onnx_file_name, verbose=True)

@cantwbr thank you for your replay.After changing my code as following,A error occurs。Do you know what happened ?Is there anything wrong with my code? thank you .

_model = SentenceTransformer('roberta-base-nli-stsb-mean-tokens',device='cpu') batch_size=1 max_seq_length=128 device = torch.device("cpu") model.to(device) input_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device) input_type_ids = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device) input_mask = torch.ones(batch_size, max_seq_length, dtype=torch.long).to(device) input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'attention_mask': input_mask} onnx_path = "onnx_model_name.onnx" torch.onnx.export(model, input_features, onnx_path)_

error: 2020-12-17 21:24:17 - Load pretrained SentenceTransformer: roberta-base-nli-stsb-mean-tokens 2020-12-17 21:24:17 - Load SentenceTransformer from folder: roberta-base-nli-stsb-mean-tokens Traceback (most recent call last): File "embedding_reduce_cp.py", line 45, in torch.onnx.export(model, input_features, onnx_path) File "/opt/conda/lib/python3.6/site-packages/torch/onnx/init.py", line 230, in export custom_opsets, enable_onnx_checker, use_external_data_format) File "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py", line 92, in export use_external_data_format=use_external_data_format) File "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py", line 538, in _export fixed_batch_size=fixed_batch_size) File "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py", line 374, in _model_to_graph graph, torch_out = _trace_and_get_graph_from_model(model, args) File "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py", line 327, in _trace_and_get_graph_from_model torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True) File "/opt/conda/lib/python3.6/site-packages/torch/jit/init.py", line 135, in _get_trace_graph outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 116, in forward self._force_outplace, File "/opt/conda/lib/python3.6/site-packages/torch/jit/_trace.py", line 102, in wrapper outs.append(self.inner(trace_inputs)) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl result = self._slow_forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 708, in _slow_forward result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl result = self._slow_forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 708, in _slow_forward result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 36, in forward output_states = self.auto_model(features) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl result = self._slow_forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 708, in _slow_forward result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 675, in forward input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl result = self._slow_forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 708, in _slow_forward result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 119, in forward token_type_embeddings = self.token_type_embeddings(token_type_ids) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl result = self._slow_forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 708, in _slow_forward result = self.forward(input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1837, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

cantwbr commented 3 years ago

@codingliuyg: I tried exporting roberta-base-nli-stsb-mean-tokens and got a similar error. I resolved it by changing torch.ones to torch.zeros. I used torch 1.4.0, sentence-transformers 0.3.7.2 in Python 3.8.

shar999 commented 3 years ago

@cantwbr can you share the code that you used for inference?

cantwbr commented 3 years ago

@shar999: For inference using the ONNX file, I reuse the featurizer of the original SentenceBert package from

from sentence_transformers.datasets import EncodeDataset

import onnxruntime as ort
from sentence_transformers import SentenceTransformer
from torch.utils.data import DataLoader

The inference:

model = SentenceTransformer(model_name_or_path)
fused_bert_session = ort.InferenceSession(onnx_file_name)

batch_size = 1
is_pretokenized = False
num_workers = 0
pad_to = 128

all_embeddings = []
length_sorted_idx = np.argsort([model._text_length(sen) for sen in sentences])
sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
inp_dataset = EncodeDataset(sentences_sorted, model=model, is_tokenized=is_pretokenized)
inp_dataloader = DataLoader(inp_dataset, batch_size=batch_size, collate_fn=model.smart_batching_collate_text_only,
                                num_workers=num_workers, shuffle=False)

iterator = inp_dataloader
for features in iterator:
   for feature_name in features:
         pad_amt = features[feature_name].size(-1)
         if pad_amt != 0:
             features[feature_name] = nn.functional.pad(features[feature_name], (0, pad_to - pad_amt), value=0)

    e_input_ids = features["input_ids"].cpu().numpy()
    e_input_type_ids = features["token_type_ids"].cpu().numpy()
    e_input_mask = features["attention_mask"].cpu().numpy()

    results = fused_bert_session.run(None,{"input_ids": e_input_ids, "token_type_ids": e_input_type_ids, "attention_mask": e_input_mask,})
    # results[5] contains the sentence_embeddings
    all_embeddings.append(results[5])

The embeddings are located in all_embeddings.

yuanzhoulvpi2017 commented 2 years ago

i have the same problem to speed up encode of sentence transformer . so I convert sentence-transformer model to onnx model and tensorrt model. about 4 times faster

you can use my tutorial: quick_sentence_transformers

this tutorial show how convert sentencetransformer model to onnx and plan file

image

TalhaUsuf commented 2 years ago

i have the same problem to speed up encode of sentence transformer . so I convert sentence-transformer model to onnx model and tensorrt model. about 4 times faster

you can use my tutorial: quick_sentence_transformers

this tutorial show how convert sentencetransformer model to onnx and plan file

image

Thanks , It really helped me understand the conversion process.

raphaelsty commented 2 years ago

Hi,

I had some trouble converting the sentence-transformers/all-mpnet-base-v2 model with onnx format so I'll share with you a class and a function that I have made with @yuanzhoulvpi2017 tutorial (it was helpful, thank you).

I've done some tests and I tend to measure a 4x speedup using onnx format. I'm not sure my code is fully optimised.

import torch
import transformers
from sentence_transformers import SentenceTransformer, models

class OnnxEncoder:
    """OnxEncoder dedicated to run SentenceTransformer under OnnxRuntime."""

    def __init__(self, session, tokenizer, pooling, normalization):
        self.session = session
        self.tokenizer = tokenizer
        self.max_length = tokenizer.__dict__["model_max_length"]
        self.pooling = pooling
        self.normalization = normalization

    def encode(self, sentences: list):

        sentences = [sentences] if isinstance(sentences, str) else sentences

        inputs = {
            k: v.numpy()
            for k, v in self.tokenizer(
                sentences,
                padding=True,
                truncation=True,
                max_length=self.max_length,
                return_tensors="pt",
            ).items()
        }

        hidden_state = self.session.run(None, inputs)
        sentence_embedding = self.pooling.forward(
            features={
                "token_embeddings": torch.Tensor(hidden_state[0]),
                "attention_mask": torch.Tensor(inputs.get("attention_mask")),
            },
        )

        if self.normalization is not None:
            sentence_embedding = self.normalization.forward(features=sentence_embedding)

        sentence_embedding = sentence_embedding["sentence_embedding"]

        if sentence_embedding.shape[0] == 1:
            sentence_embedding = sentence_embedding[0]

        return sentence_embedding.numpy()

def sentence_transformers_onnx(
    model,
    path,
    do_lower_case=True,
    input_names=["input_ids", "attention_mask", "segment_ids"],
    providers=["CPUExecutionProvider"],
):
    """OnxRuntime for sentence transformers.

    Parameters
    ----------
    model
        SentenceTransformer model.
    path
        Model file dedicated to session inference.
    do_lower_case
        Either or not the model is cased.
    input_names
        Fields needed by the Transformer.
    providers
        Either run the model on CPU or GPU: ["CPUExecutionProvider", "CUDAExecutionProvider"].

    """
    try:
        import onnxruntime
    except:
        raise ValueError("You need to install onnxruntime.")

    model.save(path)

    configuration = transformers.AutoConfig.from_pretrained(
        path, from_tf=False, local_files_only=True
    )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        path, do_lower_case=do_lower_case, from_tf=False, local_files_only=True
    )

    encoder = transformers.AutoModel.from_pretrained(
        path, from_tf=False, config=configuration, local_files_only=True
    )

    st = ["cherche"]

    inputs = tokenizer(
        st,
        padding=True,
        truncation=True,
        max_length=tokenizer.__dict__["model_max_length"],
        return_tensors="pt",
    )

    model.eval()

    with torch.no_grad():

        symbolic_names = {0: "batch_size", 1: "max_seq_len"}

        torch.onnx.export(
            encoder,
            args=tuple(inputs.values()),
            f=f"{path}.onx",
            opset_version=13,  # ONX version needs to be >= 13 for sentence transformers.
            do_constant_folding=True,
            input_names=input_names,
            output_names=["start", "end"],
            dynamic_axes={
                "input_ids": symbolic_names,
                "attention_mask": symbolic_names,
                "segment_ids": symbolic_names,
                "start": symbolic_names,
                "end": symbolic_names,
            },
        )

        normalization = None
        for modules in model.modules():
            for idx, module in enumerate(modules):
                if idx == 1:
                    pooling = module
                if idx == 2:
                    normalization = module
            break

        return OnnxEncoder(
            session=onnxruntime.InferenceSession(f"{path}.onx", providers=providers),
            tokenizer=tokenizer,
            pooling=pooling,
            normalization=normalization,
        )

The sentence_transformers_onx function returns a model with a method encode that behave like SentenceTransformers models.

model = sentence_transformers_onnx(
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2"),
    path = "onnx_model",
)

Raphaël

lkluo commented 2 years ago

Hi,

I had some trouble converting the sentence-transformers/all-mpnet-base-v2 model with onnx format so I'll share with you a class and a function that I have made with @yuanzhoulvpi2017 tutorial (it was helpful, thank you).

I've done some tests and I tend to measure a 4x speedup using onnx format. I'm not sure my code is fully optimised.

import torch
import transformers
from sentence_transformers import SentenceTransformer, models

class OnnxEncoder:
    """OnxEncoder dedicated to run SentenceTransformer under OnnxRuntime."""

    def __init__(self, session, tokenizer, pooling, normalization):
        self.session = session
        self.tokenizer = tokenizer
        self.max_length = tokenizer.__dict__["model_max_length"]
        self.pooling = pooling
        self.normalization = normalization

    def encode(self, sentences: list):

        sentences = [sentences] if isinstance(sentences, str) else sentences

        inputs = {
            k: v.numpy()
            for k, v in self.tokenizer(
                sentences,
                padding=True,
                truncation=True,
                max_length=self.max_length,
                return_tensors="pt",
            ).items()
        }

        hidden_state = self.session.run(None, inputs)
        sentence_embedding = self.pooling.forward(
            features={
                "token_embeddings": torch.Tensor(hidden_state[0]),
                "attention_mask": torch.Tensor(inputs.get("attention_mask")),
            },
        )

        if self.normalization is not None:
            sentence_embedding = self.normalization.forward(features=sentence_embedding)

        sentence_embedding = sentence_embedding["sentence_embedding"]

        if sentence_embedding.shape[0] == 1:
            sentence_embedding = sentence_embedding[0]

        return sentence_embedding.numpy()

def sentence_transformers_onnx(
    model,
    path,
    do_lower_case=True,
    input_names=["input_ids", "attention_mask", "segment_ids"],
    providers=["CPUExecutionProvider"],
):
    """OnxRuntime for sentence transformers.

    Parameters
    ----------
    model
        SentenceTransformer model.
    path
        Model file dedicated to session inference.
    do_lower_case
        Either or not the model is cased.
    input_names
        Fields needed by the Transformer.
    providers
        Either run the model on CPU or GPU: ["CPUExecutionProvider", "CUDAExecutionProvider"].

    """
    try:
        import onnxruntime
    except:
        raise ValueError("You need to install onnxruntime.")

    model.save(path)

    configuration = transformers.AutoConfig.from_pretrained(
        path, from_tf=False, local_files_only=True
    )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        path, do_lower_case=do_lower_case, from_tf=False, local_files_only=True
    )

    encoder = transformers.AutoModel.from_pretrained(
        path, from_tf=False, config=configuration, local_files_only=True
    )

    st = ["cherche"]

    inputs = tokenizer(
        st,
        padding=True,
        truncation=True,
        max_length=tokenizer.__dict__["model_max_length"],
        return_tensors="pt",
    )

    model.eval()

    with torch.no_grad():

        symbolic_names = {0: "batch_size", 1: "max_seq_len"}

        torch.onnx.export(
            encoder,
            args=tuple(inputs.values()),
            f=f"{path}.onx",
            opset_version=13,  # ONX version needs to be >= 13 for sentence transformers.
            do_constant_folding=True,
            input_names=input_names,
            output_names=["start", "end"],
            dynamic_axes={
                "input_ids": symbolic_names,
                "attention_mask": symbolic_names,
                "segment_ids": symbolic_names,
                "start": symbolic_names,
                "end": symbolic_names,
            },
        )

        normalization = None
        for modules in model.modules():
            for idx, module in enumerate(modules):
                if idx == 1:
                    pooling = module
                if idx == 2:
                    normalization = module
            break

        return OnnxEncoder(
            session=onnxruntime.InferenceSession(f"{path}.onx", providers=providers),
            tokenizer=tokenizer,
            pooling=pooling,
            normalization=normalization,
        )

The sentence_transformers_onx function returns a model with a method encode that behave like SentenceTransformers models.

model = sentence_transformers_onnx(
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2"),
    path = "onnx_model",
)

Raphaël

Thanks for nice work. On the other hand, it took longer using onnx than normal SentenceTransformer under GPU. Any thoughts?

raphaelsty commented 2 years ago

It may be due to the pooling operation that we keep in plain Pytorch ?

petrLorenc commented 1 year ago

Hi, I use SentenceTransformer as ONNX model this way - may be useful for someone:

import os
from pathlib import Path
from dataclasses import dataclass
from typing import Optional, Union, Mapping, OrderedDict

import torch
from transformers.onnx import export
from transformers.onnx import OnnxConfig
from transformers.utils import ModelOutput
from sentence_transformers.models import Dense
from transformers import AutoTokenizer, AutoModel, DistilBertModel

# get with SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2', cache_folder=".")
model_ckpt = "./sentence-transformers_distiluse-base-multilingual-cased-v2"

class SBertOnnxConfig(OnnxConfig):
    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict([
            ("input_ids", {0: "batch", 1: "sequence"}),
            ("attention_mask", {0: "batch", 1: "sequence"})
        ])
    @property
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict([
                ("last_hidden_state", {0: "batch", 1: "sequence"})
        ])

@dataclass
class EmbeddingOutput(ModelOutput):
    last_hidden_state: Optional[torch.FloatTensor] = None

class OwnSBert(DistilBertModel):
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],  *model_args, **kwargs):
        _model = super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
        additional_layer = Dense.load(kwargs.get("path_to_additional_layer"))
        _model.additional_layer_linear = additional_layer.linear
        _model.additional_layer_activation = additional_layer.activation_function
        return _model

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):
        embeddings = super().forward(input_ids=input_ids,
                               attention_mask=attention_mask,
                               head_mask=head_mask,
                               inputs_embeds=inputs_embeds,
                               output_attentions=True,
                               output_hidden_states=True,
                               return_dict=True)

        mean_embedding = embeddings.last_hidden_state.mean(dim=1)
        last_hidden_state = self.additional_layer_activation(self.additional_layer_linear(mean_embedding))
        return EmbeddingOutput(last_hidden_state=last_hidden_state)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
base_model = OwnSBert.from_pretrained(model_ckpt, path_to_additional_layer="./sentence-transformers_distiluse-base-multilingual-cased-v2/2_Dense")

# print(base_model(**tokenizer([sentences[0], sentences[1]], padding="longest", truncation=True, return_tensors="pt")))

onnx_path = Path("exported_model/model.onnx")
onnx_config = SBertOnnxConfig.from_model_config(base_model.config)
onnx_inputs, onnx_outputs = export(tokenizer, base_model, onnx_config, onnx_config.default_onnx_opset, onnx_path)
base_model.config.save_pretrained("./exported_model/")

and then when I compare output of original implementation and loaded ONNX model it is same.

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction
sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/distiluse-base-multilingual-cased-v2")
onnx_model = ORTModelForFeatureExtraction.from_pretrained("exported_model/")
inputs_2 = tokenizer([sentences[0], sentences[1]], padding="longest", truncation=True, return_tensors="pt")
outputs_2 = onnx_model(**inputs_2)
print(outputs_2)
# BaseModelOutput(last_hidden_state=tensor([[-0.0348,  0.0264, -0.0443,  ...,
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2', cache_folder=".")
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.03479306  0.02635195 -0.04427201 ... 

There is some rounding behavior which change the output of ONNX model to lower precision but this is not happening when running on server (NVIDIA Triton) so I assume it is somehow related to Jupyter Notebook where I was doing the experiments.

The reason why I override from_pretrained function is that I was not able to load proper weights in __init__ of the DistilBertModel and the weights were overwritten each time with random values - but this hack use the right values. Then to run it on server (NVIDIA Trition) I am using ensemble of models and join it with tokenizer on the server (but it is not fully related to exporting)

loretoparisi commented 4 months ago

model_ckpt

I get a

TypeError: __init__() got an unexpected keyword argument 'path_to_additional_layer'

error in that case

loretoparisi commented 4 months ago

Hi, I had some trouble converting the sentence-transformers/all-mpnet-base-v2 model with onnx format so I'll share with you a class and a function that I have made with @yuanzhoulvpi2017 tutorial (it was helpful, thank you). I've done some tests and I tend to measure a 4x speedup using onnx format. I'm not sure my code is fully optimised.

import torch
import transformers
from sentence_transformers import SentenceTransformer, models

class OnnxEncoder:
    """OnxEncoder dedicated to run SentenceTransformer under OnnxRuntime."""

    def __init__(self, session, tokenizer, pooling, normalization):
        self.session = session
        self.tokenizer = tokenizer
        self.max_length = tokenizer.__dict__["model_max_length"]
        self.pooling = pooling
        self.normalization = normalization

    def encode(self, sentences: list):

        sentences = [sentences] if isinstance(sentences, str) else sentences

        inputs = {
            k: v.numpy()
            for k, v in self.tokenizer(
                sentences,
                padding=True,
                truncation=True,
                max_length=self.max_length,
                return_tensors="pt",
            ).items()
        }

        hidden_state = self.session.run(None, inputs)
        sentence_embedding = self.pooling.forward(
            features={
                "token_embeddings": torch.Tensor(hidden_state[0]),
                "attention_mask": torch.Tensor(inputs.get("attention_mask")),
            },
        )

        if self.normalization is not None:
            sentence_embedding = self.normalization.forward(features=sentence_embedding)

        sentence_embedding = sentence_embedding["sentence_embedding"]

        if sentence_embedding.shape[0] == 1:
            sentence_embedding = sentence_embedding[0]

        return sentence_embedding.numpy()

def sentence_transformers_onnx(
    model,
    path,
    do_lower_case=True,
    input_names=["input_ids", "attention_mask", "segment_ids"],
    providers=["CPUExecutionProvider"],
):
    """OnxRuntime for sentence transformers.

    Parameters
    ----------
    model
        SentenceTransformer model.
    path
        Model file dedicated to session inference.
    do_lower_case
        Either or not the model is cased.
    input_names
        Fields needed by the Transformer.
    providers
        Either run the model on CPU or GPU: ["CPUExecutionProvider", "CUDAExecutionProvider"].

    """
    try:
        import onnxruntime
    except:
        raise ValueError("You need to install onnxruntime.")

    model.save(path)

    configuration = transformers.AutoConfig.from_pretrained(
        path, from_tf=False, local_files_only=True
    )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        path, do_lower_case=do_lower_case, from_tf=False, local_files_only=True
    )

    encoder = transformers.AutoModel.from_pretrained(
        path, from_tf=False, config=configuration, local_files_only=True
    )

    st = ["cherche"]

    inputs = tokenizer(
        st,
        padding=True,
        truncation=True,
        max_length=tokenizer.__dict__["model_max_length"],
        return_tensors="pt",
    )

    model.eval()

    with torch.no_grad():

        symbolic_names = {0: "batch_size", 1: "max_seq_len"}

        torch.onnx.export(
            encoder,
            args=tuple(inputs.values()),
            f=f"{path}.onx",
            opset_version=13,  # ONX version needs to be >= 13 for sentence transformers.
            do_constant_folding=True,
            input_names=input_names,
            output_names=["start", "end"],
            dynamic_axes={
                "input_ids": symbolic_names,
                "attention_mask": symbolic_names,
                "segment_ids": symbolic_names,
                "start": symbolic_names,
                "end": symbolic_names,
            },
        )

        normalization = None
        for modules in model.modules():
            for idx, module in enumerate(modules):
                if idx == 1:
                    pooling = module
                if idx == 2:
                    normalization = module
            break

        return OnnxEncoder(
            session=onnxruntime.InferenceSession(f"{path}.onx", providers=providers),
            tokenizer=tokenizer,
            pooling=pooling,
            normalization=normalization,
        )

The sentence_transformers_onx function returns a model with a method encode that behave like SentenceTransformers models.

model = sentence_transformers_onnx(
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2"),
    path = "onnx_model",
)

Raphaël

Thanks for nice work. On the other hand, it took longer using onnx than normal SentenceTransformer under GPU. Any thoughts?

If I do

self.model = sentence_transformers_onnx(
                model = SentenceTransformer(model_name),
                path = "onnx_model",
            )
with torch.no_grad():
    model_output = self.model(**encoded_input)
    sentence_embeddings = model_output[0][:, 0]

I'm then getting the error

model_output = self.model(**encoded_input)
TypeError: 'OnnxEncoder' object is not callable
tomaarsen commented 3 weeks ago

Hello!

I've added native ONNX support in Sentence Transformers, so users can now look at the Speeding up Inference documentation, under the ONNX section:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)