Open codingliuyg opened 3 years ago
@codingliuyg I sadly never worked with ONNX, so I sadly cannot help here as I don't know what happens
@nreimers : Thanks a bunch for this library. Do you think it is possible to modify the library, so that the forward argument is not a dictionary? Exactly as it is done in the transformers library. Because that would fix the ONNX incompatibility. ONNX can speed up the models by a factor of 2 to 8 on a V100.
While the ONNX export of the model works, specifying dynamic axes (which are required for inputs) does not work with a dict input. Thus, one can only export for a fixed size input, which is useless.
To give a working example for distilroberta-base:
import torch
import transformers
from transformers import convert_graph_to_onnx
from sentence_transformers import SentenceTransformer, util
sent_roberta = SentenceTransformer('msmarco-distilroberta-base-v2', device="cuda")
base_roberta = convert_graph_to_onnx.load_graph_from_args("feature-extraction", "pt", "distilroberta-base", None)
with torch.no_grad():
input_names, output_names, dynamic_axes, tokens = convert_graph_to_onnx.infer_shapes(base_roberta, "pt")
ordered_input_names, model_args = convert_graph_to_onnx.ensure_valid_input(base_roberta.model, tokens, input_names)
# Result of the upper assignments:
input_names = ['input_ids', 'attention_mask']
output_names = ['output_0', 'output_1']
dynamic_axes = {
'input_ids': {0: 'batch', 1: 'sequence'},
'attention_mask': {0: 'batch', 1: 'sequence'},
'output_0': {0: 'batch', 1: 'sequence'},
'output_1': {0: 'batch'}
}
tokens = {
'input_ids': torch.tensor([[ 0, 713, 16, 10, 7728, 4195, 2]]).long(),
'attention_mask': torch.tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
}
ordered_input_names = ['input_ids', 'attention_mask']
model_args = (
torch.Tensor([[ 0, 713, 16, 10, 7728, 4195, 2]]).long(),
torch.Tensor([[1, 1, 1, 1, 1, 1, 1]]).long()
)
# Regular export by torch
torch.onnx.export(
base_roberta.model,
model_args,
f="roberta.onnx",
input_names=input_names,
output_names=output_names,
dynamic_axes=dynamic_axes,
do_constant_folding=True,
use_external_data_format=False,
enable_onnx_checker=True,
opset_version=11,
)
import onnxruntime as rt
import numpy as np
opt = rt.SessionOptions()
opt.graph_optimization_level= rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
opt.log_severity_level=3
opt.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
sess = rt.InferenceSession("roberta.onnx", opt)
span = 'How big is London'
model_input = base_roberta.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)
np.allclose(
base_roberta(span)[0][0],
output[0][0][0].tolist(),
atol=1e-6,
)
%%timeit
base_roberta(span)[0][0]
> 15.9 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
sess.run(None, model_input)[0][0][0]
> 998 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
sent_roberta.encode(span)
> 8.93 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, effectively, this could result in a speedup by a factor of 8-9.
@oborchers That is interesting, thanks for posting. I sadly don't have experiences yet with ONNX. Do you know a good tutorial that covers the basics how the format has to look like so that you can use it with transformers models?
@nreimers: Actually I have to excuse myself. I wasn't aware that the sentence-transformers models have been pushed to the model hub, as described in #46. This makes them capable of ONNX export. For example: sentence-transformers/bert-large-nli-mean-tokens
sbert = transformers.FeatureExtractionPipeline(
model=transformers.AutoModel.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
tokenizer=transformers.AutoTokenizer.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"),
framework="pt",
device=0
)
%%timeit
sbert(span)[0]
> 25.4 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and under Onnx:
from transformers.convert_graph_to_onnx import convert
convert(
framework="pt",
model="sentence-transformers/bert-large-nli-mean-tokens",
output=Path("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx"),
opset=11
)
import onnxruntime as rt
import numpy as np
opt = rt.SessionOptions()
sess = rt.InferenceSession("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", opt)
%%timeit
model_input = sbert.tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)
> 4.61 ms ± 3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
assert np.allclose(
np.array(sbert(span)[0]),
output[0][0],
atol=1e-4,
)
There already seems to exist a PR request with an ONNX example. Nevertheless, the PR in #386 has a weakness by not including the pooling layer in the final ONNX model, which should require some minor modifications. I'm going to investigate this further.
@oborchers Yes, some models are on model hub. The rest will follow soon.
Looking forward how to integrate the mean pooling to ONNX and happy to learn more about ONNX.
@nreimers Great! If they are on the model hub this will work.
I've got a running and working version with integrated mean pooling in ONNX runtime on GPU. I'm going to create a PR request for this as soon as the clean-up is done.
As for now for "bert-base-nli-stsb-mean-tokens" the results look like this: Base Sentence Transformer Model:
14.9 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And under ONNX
2.19 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation is also correct
np.allclose(model_raw.encode(span), onnx_result, atol=1e-6)
> True
For more resources on ONNX:
Hi @oborchers that looks super interesting but how use the sentence-transformer library to first fine tune a pre-trained model then export from this to ONNX format?
Thanks!
sent_roberta
Thanks for the walkthrough! I have a follow up question:
Say we are fine tuning Sentence Transformer using below approach:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
Now I want to export this model to ONNX format in such a way, that sess.run returns embedding vector (the one equivalent to model.encode)
Is this possible, if yes, how?
Hi @oborchers I was able to learn from your notebook and successfully exported the model in onnx format. However, we need to compute model_input separately (tokenizer.encode_plus step), is there a way this can be integrated in the ONNX model?
@nreimers: Actually I have to excuse myself. I wasn't aware that the sentence-transformers models have been pushed to the model hub, as described in #46. This makes them capable of ONNX export. For example: sentence-transformers/bert-large-nli-mean-tokens
sbert = transformers.FeatureExtractionPipeline( model=transformers.AutoModel.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"), tokenizer=transformers.AutoTokenizer.from_pretrained("sentence-transformers/bert-large-nli-mean-tokens"), framework="pt", device=0 )
%%timeit sbert(span)[0] > 25.4 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and under Onnx:
from transformers.convert_graph_to_onnx import convert convert( framework="pt", model="sentence-transformers/bert-large-nli-mean-tokens", output=Path("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx"), opset=11 ) import onnxruntime as rt import numpy as np opt = rt.SessionOptions() sess = rt.InferenceSession("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", opt)
%%timeit model_input = sbert.tokenizer.encode_plus(span) model_input = {name : np.atleast_2d(value) for name, value in model_input.items()} output = sess.run(None, model_input) > 4.61 ms ± 3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
assert np.allclose( np.array(sbert(span)[0]), output[0][0], atol=1e-4, )
There already seems to exist a PR request with an ONNX example. Nevertheless, the PR in #386 has a weakness by not including the pooling layer in the final ONNX model, which should require some minor modifications. I'm going to investigate this further.
Hi, thanks your work for example exported the model in onnx model. I notice that some model like "distiluse-base-multilingual-cased-v1" may not only including the pooling layer , but also including another dense layer.Could I get any idea for how to deal with the dense layer? thanks
Hey @oborchers I followed your book, but I am experiecing the fact that it appears that the onnx model output actually doesn't match, within an error delta of what you used to assert. Is anyone experiencing the same problem? I made sure to put the model in eval when exporting. Top is the hidden output of the model and bottom is the onnx output of the model. I also compaired it to the sentence_tranformer_model = SentenceTransformer(model_name) which the onnx model failed against but the model above didn't
torch.Size([1, 17, 768])
(1, 17, 768)
[[[ 0.06331332 0.01103058 0.0281221 ... -0.05699859 -0.05175947
-0.03224416]
[ 0.01755954 0.13200025 0.00444082 ... -0.1643722 -0.05018917
0.06951934]
[-0.0188169 0.13540238 0.07517624 ... -0.13206749 -0.13428675
0.06266046]
...
[ 0.0776599 0.17890096 -0.01615191 ... -0.0398508 0.02992048
-0.06202405]
[ 0.04454883 0.02786158 -0.01897008 ... -0.02922158 -0.00326436
-0.04401062]
[ 0.05926578 0.01135097 0.0350358 ... -0.02414411 -0.07540198
-0.05405766]]]
tensor([[[ 0.0633, 0.0110, 0.0281, ..., -0.0570, -0.0518, -0.0322],
[ 0.0176, 0.1320, 0.0044, ..., -0.1644, -0.0502, 0.0695],
[-0.0188, 0.1354, 0.0752, ..., -0.1321, -0.1343, 0.0627],
...,
[ 0.0777, 0.1789, -0.0162, ..., -0.0399, 0.0299, -0.0620],
[ 0.0445, 0.0279, -0.0190, ..., -0.0292, -0.0033, -0.0440],
[ 0.0593, 0.0114, 0.0350, ..., -0.0241, -0.0754, -0.0541]]])
@nreimers : Thanks a bunch for this library. Do you think it is possible to modify the library, so that the forward argument is not a dictionary? Exactly as it is done in the transformers library. Because that would fix the ONNX incompatibility. ONNX can speed up the models by a factor of 2 to 8 on a V100.
While the ONNX export of the model works, specifying dynamic axes (which are required for inputs) does not work with a dict input. Thus, one can only export for a fixed size input, which is useless.
To give a working example for distilroberta-base:
1. Exporting the model:
import torch import transformers from transformers import convert_graph_to_onnx from sentence_transformers import SentenceTransformer, util sent_roberta = SentenceTransformer('msmarco-distilroberta-base-v2', device="cuda") base_roberta = convert_graph_to_onnx.load_graph_from_args("feature-extraction", "pt", "distilroberta-base", None) with torch.no_grad(): input_names, output_names, dynamic_axes, tokens = convert_graph_to_onnx.infer_shapes(base_roberta, "pt") ordered_input_names, model_args = convert_graph_to_onnx.ensure_valid_input(base_roberta.model, tokens, input_names) # Result of the upper assignments: input_names = ['input_ids', 'attention_mask'] output_names = ['output_0', 'output_1'] dynamic_axes = { 'input_ids': {0: 'batch', 1: 'sequence'}, 'attention_mask': {0: 'batch', 1: 'sequence'}, 'output_0': {0: 'batch', 1: 'sequence'}, 'output_1': {0: 'batch'} } tokens = { 'input_ids': torch.tensor([[ 0, 713, 16, 10, 7728, 4195, 2]]).long(), 'attention_mask': torch.tensor([[1, 1, 1, 1, 1, 1, 1]]).long() } ordered_input_names = ['input_ids', 'attention_mask'] model_args = ( torch.Tensor([[ 0, 713, 16, 10, 7728, 4195, 2]]).long(), torch.Tensor([[1, 1, 1, 1, 1, 1, 1]]).long() ) # Regular export by torch torch.onnx.export( base_roberta.model, model_args, f="roberta.onnx", input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes, do_constant_folding=True, use_external_data_format=False, enable_onnx_checker=True, opset_version=11, )
2. Loading the model and providing inference
import onnxruntime as rt import numpy as np opt = rt.SessionOptions() opt.graph_optimization_level= rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED opt.log_severity_level=3 opt.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL sess = rt.InferenceSession("roberta.onnx", opt) span = 'How big is London' model_input = base_roberta.tokenizer.encode_plus(span) model_input = {name : np.atleast_2d(value) for name, value in model_input.items()} output = sess.run(None, model_input) np.allclose( base_roberta(span)[0][0], output[0][0][0].tolist(), atol=1e-6, )
3. Benchmarking (on a V100)
%%timeit base_roberta(span)[0][0] > 15.9 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit sess.run(None, model_input)[0][0][0] > 998 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %%timeit sent_roberta.encode(span) > 8.93 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, effectively, this could result in a speedup by a factor of 8-9.
There is a drawback of the method, that is that only the backbone model is accelerated. Sometime, we may build a model such as the following:
base_model = models.Transformer('WangZeJun/simcse-tiny-chinese-wiki')
pooling_layer = models.Pooling(base_model.get_word_embedding_dimension())
normalize = models.Normalize()
mymodel = SentenceTransformer(modules=[base_model, pooling_layer, normalize])
And I find not find a method to covert mymodel
to ONNX format.
hello, i have a fine tuned model file,and reduce the dimension by pca. Now,i wanted to convert it to the format of onnx or pt. As shown in the figure below。
Is there any ways to get the input and output tensor name of the sentence bert?
@nreimers Looking forward to your reply,thank you in advance!