UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

model file convert to onnx or pt #631

Open codingliuyg opened 3 years ago

codingliuyg commented 3 years ago

hello, i have a fine tuned model file,and reduce the dimension by pca. Now,i wanted to convert it to the format of onnx or pt. As shown in the figure below。

image

Is there any ways to get the input and output tensor name of the sentence bert?

@nreimers Looking forward to your reply,thank you in advance!

nreimers commented 3 years ago

@codingliuyg I sadly never worked with ONNX, so I sadly cannot help here as I don't know what happens

oborchers commented 3 years ago

@nreimers : Thanks a bunch for this library. Do you think it is possible to modify the library, so that the forward argument is not a dictionary? Exactly as it is done in the transformers library. Because that would fix the ONNX incompatibility. ONNX can speed up the models by a factor of 2 to 8 on a V100.

While the ONNX export of the model works, specifying dynamic axes (which are required for inputs) does not work with a dict input. Thus, one can only export for a fixed size input, which is useless.

To give a working example for distilroberta-base:

  1. Exporting the model:
import torch
import transformers

from transformers import convert_graph_to_onnx

from sentence_transformers import SentenceTransformer, util
sent_roberta = SentenceTransformer('msmarco-distilroberta-base-v2', device="cuda")

base_roberta = convert_graph_to_onnx.load_graph_from_args("feature-extraction", "pt", "distilroberta-base", None)
with torch.no_grad():
    input_names, output_names, dynamic_axes, tokens = convert_graph_to_onnx.infer_shapes(base_roberta, "pt")
    ordered_input_names, model_args = convert_graph_to_onnx.ensure_valid_input(base_roberta.model, tokens, input_names)

# Result of the upper assignments:
input_names = ['input_ids', 'attention_mask']
output_names = ['output_0', 'output_1']
dynamic_axes = {
    'input_ids': {0: 'batch', 1: 'sequence'},
    'attention_mask': {0: 'batch', 1: 'sequence'},
    'output_0': {0: 'batch', 1: 'sequence'},
    'output_1': {0: 'batch'}
}
tokens = {
    'input_ids': torch.tensor([[   0,  713,   16,   10, 7728, 4195,    2]]).long(), 
    'attention_mask': torch.tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
}

ordered_input_names = ['input_ids', 'attention_mask']
model_args = (
    torch.Tensor([[   0,  713,   16,   10, 7728, 4195,    2]]).long(),
    torch.Tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
)

# Regular export by torch
torch.onnx.export(
    base_roberta.model,
    model_args,
    f="roberta.onnx",
    input_names=input_names,
    output_names=output_names,
    dynamic_axes=dynamic_axes,
    do_constant_folding=True,
    use_external_data_format=False,
    enable_onnx_checker=True,
    opset_version=11,
)
  1. Loading the model and providing inference
import onnxruntime as rt
import numpy as np

opt = rt.SessionOptions()
opt.graph_optimization_level= rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
opt.log_severity_level=3
opt.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL

sess = rt.InferenceSession("roberta.onnx", opt)

span = 'How big is London'
model_input = base_roberta.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)

np.allclose(
    base_roberta(span)[0][0],
    output[0][0][0].tolist(),
    atol=1e-6,
)
  1. Benchmarking (on a V100)
%%timeit
base_roberta(span)[0][0]
> 15.9 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
sess.run(None, model_input)[0][0][0]
> 998 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
sent_roberta.encode(span)
> 8.93 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, effectively, this could result in a speedup by a factor of 8-9.

nreimers commented 3 years ago

@oborchers That is interesting, thanks for posting. I sadly don't have experiences yet with ONNX. Do you know a good tutorial that covers the basics how the format has to look like so that you can use it with transformers models?

oborchers commented 3 years ago

@nreimers: Actually I have to excuse myself. I wasn't aware that the sentence-transformers models have been pushed to the model hub, as described in #46. This makes them capable of ONNX export. For example: sentence-transformers/bert-large-nli-mean-tokens

sbert = transformers.FeatureExtractionPipeline(
    model=transformers.AutoModel.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
    tokenizer=transformers.AutoTokenizer.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
    framework="pt",
    device=0
)
%%timeit
sbert(span)[0]
> 25.4 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

and under Onnx:

from transformers.convert_graph_to_onnx import convert
convert(
    framework="pt", 
    model="sentence-transformers/bert-large-nli-mean-tokens", 
    output=Path("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx"), 
    opset=11
)
import onnxruntime as rt
import numpy as np

opt = rt.SessionOptions()
sess = rt.InferenceSession("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", opt)
%%timeit
model_input = sbert.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)
> 4.61 ms ± 3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
assert np.allclose(
    np.array(sbert(span)[0]),
    output[0][0],
    atol=1e-4,
)

There already seems to exist a PR request with an ONNX example. Nevertheless, the PR in #386 has a weakness by not including the pooling layer in the final ONNX model, which should require some minor modifications. I'm going to investigate this further.

nreimers commented 3 years ago

@oborchers Yes, some models are on model hub. The rest will follow soon.

Looking forward how to integrate the mean pooling to ONNX and happy to learn more about ONNX.

oborchers commented 3 years ago

@nreimers Great! If they are on the model hub this will work.

I've got a running and working version with integrated mean pooling in ONNX runtime on GPU. I'm going to create a PR request for this as soon as the clean-up is done.

As for now for "bert-base-nli-stsb-mean-tokens" the results look like this: Base Sentence Transformer Model:

14.9 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And under ONNX

2.19 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Validation is also correct

np.allclose(model_raw.encode(span), onnx_result, atol=1e-6)
> True

For more resources on ONNX:

Matthieu-Tinycoaching commented 2 years ago

Hi @oborchers that looks super interesting but how use the sentence-transformer library to first fine tune a pre-trained model then export from this to ONNX format?

Thanks!

chintanckg commented 2 years ago

sent_roberta

Thanks for the walkthrough! I have a follow up question:

Say we are fine tuning Sentence Transformer using below approach:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

Now I want to export this model to ONNX format in such a way, that sess.run returns embedding vector (the one equivalent to model.encode)

Is this possible, if yes, how?

chintanckg commented 2 years ago

Hi @oborchers I was able to learn from your notebook and successfully exported the model in onnx format. However, we need to compute model_input separately (tokenizer.encode_plus step), is there a way this can be integrated in the ONNX model?

Graduo commented 2 years ago

@nreimers: Actually I have to excuse myself. I wasn't aware that the sentence-transformers models have been pushed to the model hub, as described in #46. This makes them capable of ONNX export. For example: sentence-transformers/bert-large-nli-mean-tokens

sbert = transformers.FeatureExtractionPipeline(
    model=transformers.AutoModel.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
    tokenizer=transformers.AutoTokenizer.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
    framework="pt",
    device=0
)
%%timeit
sbert(span)[0]
> 25.4 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

and under Onnx:

from transformers.convert_graph_to_onnx import convert
convert(
    framework="pt", 
    model="sentence-transformers/bert-large-nli-mean-tokens", 
    output=Path("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx"), 
    opset=11
)
import onnxruntime as rt
import numpy as np

opt = rt.SessionOptions()
sess = rt.InferenceSession("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", opt)
%%timeit
model_input = sbert.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)
> 4.61 ms ± 3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
assert np.allclose(
    np.array(sbert(span)[0]),
    output[0][0],
    atol=1e-4,
)

There already seems to exist a PR request with an ONNX example. Nevertheless, the PR in #386 has a weakness by not including the pooling layer in the final ONNX model, which should require some minor modifications. I'm going to investigate this further.

Hi, thanks your work for example exported the model in onnx model. I notice that some model like "distiluse-base-multilingual-cased-v1" may not only including the pooling layer , but also including another dense layer.Could I get any idea for how to deal with the dense layer? thanks

ArEnSc commented 2 years ago

Hey @oborchers I followed your book, but I am experiecing the fact that it appears that the onnx model output actually doesn't match, within an error delta of what you used to assert. Is anyone experiencing the same problem? I made sure to put the model in eval when exporting. Top is the hidden output of the model and bottom is the onnx output of the model. I also compaired it to the sentence_tranformer_model = SentenceTransformer(model_name) which the onnx model failed against but the model above didn't

torch.Size([1, 17, 768])
(1, 17, 768)
[[[ 0.06331332  0.01103058  0.0281221  ... -0.05699859 -0.05175947
   -0.03224416]
  [ 0.01755954  0.13200025  0.00444082 ... -0.1643722  -0.05018917
    0.06951934]
  [-0.0188169   0.13540238  0.07517624 ... -0.13206749 -0.13428675
    0.06266046]
  ...
  [ 0.0776599   0.17890096 -0.01615191 ... -0.0398508   0.02992048
   -0.06202405]
  [ 0.04454883  0.02786158 -0.01897008 ... -0.02922158 -0.00326436
   -0.04401062]
  [ 0.05926578  0.01135097  0.0350358  ... -0.02414411 -0.07540198
   -0.05405766]]]
tensor([[[ 0.0633,  0.0110,  0.0281,  ..., -0.0570, -0.0518, -0.0322],
         [ 0.0176,  0.1320,  0.0044,  ..., -0.1644, -0.0502,  0.0695],
         [-0.0188,  0.1354,  0.0752,  ..., -0.1321, -0.1343,  0.0627],
         ...,
         [ 0.0777,  0.1789, -0.0162,  ..., -0.0399,  0.0299, -0.0620],
         [ 0.0445,  0.0279, -0.0190,  ..., -0.0292, -0.0033, -0.0440],
         [ 0.0593,  0.0114,  0.0350,  ..., -0.0241, -0.0754, -0.0541]]])
Ywandung-Lyou commented 6 months ago

@nreimers : Thanks a bunch for this library. Do you think it is possible to modify the library, so that the forward argument is not a dictionary? Exactly as it is done in the transformers library. Because that would fix the ONNX incompatibility. ONNX can speed up the models by a factor of 2 to 8 on a V100.

While the ONNX export of the model works, specifying dynamic axes (which are required for inputs) does not work with a dict input. Thus, one can only export for a fixed size input, which is useless.

To give a working example for distilroberta-base:

1. Exporting the model:
import torch
import transformers

from transformers import convert_graph_to_onnx

from sentence_transformers import SentenceTransformer, util
sent_roberta = SentenceTransformer('msmarco-distilroberta-base-v2', device="cuda")

base_roberta = convert_graph_to_onnx.load_graph_from_args("feature-extraction", "pt", "distilroberta-base", None)
with torch.no_grad():
    input_names, output_names, dynamic_axes, tokens = convert_graph_to_onnx.infer_shapes(base_roberta, "pt")
    ordered_input_names, model_args = convert_graph_to_onnx.ensure_valid_input(base_roberta.model, tokens, input_names)

# Result of the upper assignments:
input_names = ['input_ids', 'attention_mask']
output_names = ['output_0', 'output_1']
dynamic_axes = {
    'input_ids': {0: 'batch', 1: 'sequence'},
    'attention_mask': {0: 'batch', 1: 'sequence'},
    'output_0': {0: 'batch', 1: 'sequence'},
    'output_1': {0: 'batch'}
}
tokens = {
    'input_ids': torch.tensor([[   0,  713,   16,   10, 7728, 4195,    2]]).long(), 
    'attention_mask': torch.tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
}

ordered_input_names = ['input_ids', 'attention_mask']
model_args = (
    torch.Tensor([[   0,  713,   16,   10, 7728, 4195,    2]]).long(),
    torch.Tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
)

# Regular export by torch
torch.onnx.export(
    base_roberta.model,
    model_args,
    f="roberta.onnx",
    input_names=input_names,
    output_names=output_names,
    dynamic_axes=dynamic_axes,
    do_constant_folding=True,
    use_external_data_format=False,
    enable_onnx_checker=True,
    opset_version=11,
)
2. Loading the model and providing inference
import onnxruntime as rt
import numpy as np

opt = rt.SessionOptions()
opt.graph_optimization_level= rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
opt.log_severity_level=3
opt.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL

sess = rt.InferenceSession("roberta.onnx", opt)

span = 'How big is London'
model_input = base_roberta.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)

np.allclose(
    base_roberta(span)[0][0],
    output[0][0][0].tolist(),
    atol=1e-6,
)
3. Benchmarking (on a V100)
%%timeit
base_roberta(span)[0][0]
> 15.9 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
sess.run(None, model_input)[0][0][0]
> 998 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
sent_roberta.encode(span)
> 8.93 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, effectively, this could result in a speedup by a factor of 8-9.

There is a drawback of the method, that is that only the backbone model is accelerated. Sometime, we may build a model such as the following:

base_model = models.Transformer('WangZeJun/simcse-tiny-chinese-wiki')
pooling_layer = models.Pooling(base_model.get_word_embedding_dimension())
normalize = models.Normalize()
mymodel = SentenceTransformer(modules=[base_model, pooling_layer, normalize])

And I find not find a method to covert mymodel to ONNX format.