Open HenryDashwood opened 3 years ago
thank you!
@tobigue was able to export mbart
to onnx, he might be able to help.
Hey, I also came across this when trying to adapt the fastT5 code for converting (m)BART to onnx and I think it is due to the fact that the encoder_hidden_states
are passed as key_value_states
into the MBartAttention
down the line and are not used in case past_key_value
is given.
So my guess was, that it is not included in the exported graph for that reason. I'll try to share my progress in a Colab notebook tomorrow.
Ah that's really good to know. Thought I was going mad for a bit! Please do keep us updated!
@HenryDashwood this is my current progress when converting mbart to onnx: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing
Currently the ONNX model gives a (slightly) different output for the same input than the pytorch model and I'm not sure why. You can see that at the very end of the notebook.
Also, as I had to use the external data format (the mBART model export failed when I did not, because apparently the export is >2GB for some reason, even though T5-large fits into <2GB with fastT5), quantizing the model is not straight forward, as ONNXQuantizer uses infer_shapes
, but the external data format is only supported by infer_shapes_path
. Also, there is a problem when loading models that are not saved in ./
, because f.name
is used instead of f
.
I changed those places in the onnx/onnxruntime code and was able to run the quantization, however, when trying to load the quantized models I get this:
[ONNXRuntimeError] : 1 : FAIL : Load model from models/mbart-large-en-ro/decoder-quantized/model.onnx failed:This is an invalid model. Error: Duplicate definition of name (decoder.embed_tokens.weight).
If you have any insights on the differing outputs, or how to make the quantization work, or manage to export and quantize a BART model in another way, sharing is much appreciated. :)
Cheers!
The reason the init decoder was so large was that it exported the token embeddings twice. With do_constant_folding=False
it only exports them once and then the model can be exported without using the external data format. I also looked at the mBART implementation again and saw that they do not do the rescaling of the output (no * (self.model_dim ** -0.5)
), but do add a final_logits_bias
. I adjusted the notebook accordingly and will try quantization tomorrow.
I got the quantization for the exported model to work, but only using onnxruntime==1.6.0
. See this issue.
cool!
I also had some issues with 1.7.0
, while using the onnxruntime==1.7.0
for quantizing it created extra models here's the issue . by applying optimize_model=False
was able to fix it.
constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with using do_constant_folding=True
It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196
If you also export the init decoder with constant folding and use the external data format (or inspect the model with netron), we might see the same behaviour.
also I noticed that in the notebook
input_names = [x.name for x in self.decoder.get_inputs()]
inputs = [
input_ids.cpu().numpy(),
attention_mask.cpu().numpy(),
] + [
tensor.cpu().numpy() for tensor in flat_past_key_values
]
decoder_inputs = dict(zip(input_names, inputs))
decoder_outputs = self.decoder.run(None, decoder_inputs)
in the above code you are using two for
loops & zip
it might affect the model speed during the inference.
fastT5 code for comparison :
past_key_values = {
f"pkv_{i}": pkv.cpu().numpy() for i, pkv in enumerate(flat_past_key_values)
}
decoder_outputs = self.decoder.run(None, {**decoder_inputs, **past_key_values})
I've not tested how much effect it might have.
It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196
you're right I did not notice it before but after testing got these results.
# t5-large =>
# init_decoder = 1.75G with do_constant_folding=True
# init_decoder = 1.62G with do_constant_folding=False
just a small difference. and for smaller models it's negligible :)
just a small difference. and for smaller models it's negligible :)
yeah, the English models have a quite small vocabulary compared to the multilingual models like mT5 and mBART.
You are right about the loops + zip, I would suspect that this should be quite fast though, because we are only iterating over ~50 items. When everything works as expected maybe just naming the inputs from 1 to X is the easier and faster way. :+1:
Nice work! Interestingly I have been able to quantize Bart with onnxruntime 1.7.0 on my Mac without setting optimize_model
to False
Interesting that you don't get the Duplicate definition of name (decoder.embed_tokens.weight)
error with 1.7.0. Did you do something different when building the simplified models?
I think optimization works in your case, because the vocabulary of BART is much smaller than mBART, so duplicating the embeddings will probably not result in a model >2GB.
yeah i suspect that's right. I haven't had to use external_data_format or anything like that. How are you doing in terms of speed? I'm not really seeing any speedups in my code which is probably a bug i've introduced because when i run your notebook, i do. Here's mine: https://colab.research.google.com/drive/1e3b9_a6UvNjJWemqubYA7YKxC7vYrvAX?usp=sharing
@HenryDashwood Hi, I'm trying to export bart-large-cnn model for summarization to onnx and tried to use your colab, but stuck with error on the export step:
RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)
Did you face it and how did you fix it?
Very odd error that I also get sometimes and my only answer at this point is... have you tried running the cell again? That usually sorts it
@HenryDashwood thanks, I was able to run it, but I have a question. Is there a difference when you run the model through generate func or with
ort_session = onnxruntime.InferenceSession("super_resolution.onnx")
ort_outs = ort_session.run(None, ort_inputs)
?
If I understand your answer correctly, we do both. ort_session.run() gets called inside generate to get predictions from the model. But we need more than just a single set of predictions do do something like beam search. So generate() takes care of that as well
@HenryDashwood great work, I have been struggling on BART Summarization to ONNX & this gives me a ray of hope... [ My earlier tries to onnx never gave final inference summaries similar to non-onnx versions. I am pretty sure, I made some errors refactoring... ]
I downloaded the collab & running it locally as-is without any modifications, it's throwing an error stated below, curious if any particular versions of libraries I should be on?
RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1614378098133/work/torch/csrc/jit/passes/onnx/shape_type_inference.cpp":676, please report a bug to PyTorch.
On a side note, can the similar be possibly achieved by JIT Libraries export? I tried that as well & after making model export & Triton Server work, failing at Inference ... https://github.com/triton-inference-server/server/issues/2846
Resolved " RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED " by upgrading to torch 1.8.1
@HenryDashwood et al am I reading it right:
Outcome of your analysis is that ONNX and Quantized are slower than PyTorch ??
If so, what is the best way out to get better performance? I am clocking 900ms to 1.4 sec on T4 per call. I am working on a use case that requires this capability to be enabled in real-time, which is where this study to optimize with the goal to make it 500ms or less. Any suggestions are much appreciated.
I know very little about ONNX and JIT exports but my goal was to export these models, host them on Triton Server with T4s & hopefully get the boost as end goal. Not sure if I am on right path ?
HI @anshoomehra sorry it took a while to get back. I haven't done much ONNX work in the last couple of weeks so I'm not really sure why I didn't get a speedup. I'll get back to you if I work it out though!
@HenryDashwood appreciate the acknowledgment! Yes, please do continue (if time permits), yours is the only implementation that I have seen working without losing the final output quality & if ONNX brings down the inference time will be significant value addition.
I am dedicating the next few days on it & if I have any breakthroughs shall share back ... Thanks much!!
@HenryDashwood I have been able to make some progress with ONNX, however, shifting to GPUs seems challenging.
I have made ONNX runtime working to engage GPUs using provider=['CUDAExecutionProvider'] (This step itself was issue because of conflicting libraries, we had to install onnxruntime-gpu and uninstall onnx, onnxruntime to make this work)
Now, it seems the way graph is exported some variables are still on CPU causing below error, I tried making model and tokenizer to device cuda along with exported version using CUDAExecutionProvider but it still fails, would you have any immediate thought on what code fix we may need to resolve this??
Did you try explicitly setting the device of token["input_ids"] and token["attention_mask"]? That's the only thing I can think of.
You might be interested in following this work as it makes progress https://github.com/huggingface/transformers/pull/11786
@HenryDashwood @tobigue I am looking at your implementation of mbart on onnx and wanted to see if you had any feedback on how to make this work for mBart-50 many to many: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt
Specifically, in the generate method, it passes forced_bos_token_id as first generated token to translate into target language. Could you provide any pointers where this parameter can be passed in your implementation: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing#scrollTo=z5j0jM4CRJHj
Presumably it would be an extra argument passed to the decoder, and would be added to the list of decoder inputs?
Hello @anshoomehra did you manage to solve the issues with the GPU shifting?
@HenryDashwood @Ki6an when replicating your colab notebook
I am facing the following error:
This seems like an issue similar to the following one: https://github.com/Ki6an/fastT5/issues/18
Following are the library versions I am using:
torch - 1.10.0
transformers - 4.11.3
onnx - 1.10.2
onnxruntime - 1.7.0
TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'
@sidsharma72 we had the same issue with T5, was able to resolve it in this commit
make sure to add the cross_attn_head_mask=None,
argument in the forward
method of OnnxMBart
in the notebook, and try again.
Thanks for the response - I have been able to fix the issue.
Hey, @HenryDashwood @tobigue I was able to use the collab for the model - lidiya/bart-large-xsum-samsum. ONNX was considerably better on CPU i.e, 4 secs whereas Pytorch was 11 secs but, over GPU ONNX was taking ~5 secs whereas Pytorch was ~500 ms I have changed all the tensors to GPU & converted all the models to GPU. Can you please let me know if this is expected?
Hey @mohanvamsibu-kore, I don't have experience with ONNX on GPU, but from what I understand the most efficient way to run stuff on GPU is by using TensorRT. They have an example for T5, which ich pretty similar to BART, here: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace
@tobigue @HenryDashwood @Ki6an Thank you very much for your work, very helpful! Now I can convert bart model to onnx model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onnx model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT is not consistent with onnx model :(
Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.
I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the
encoder_hidden_states
input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.