Ki6an / fastT5

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.
Apache License 2.0
565 stars 72 forks source link

Different behaviour when extending this project to Bart #7

Open HenryDashwood opened 3 years ago

HenryDashwood commented 3 years ago

Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.

I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the encoder_hidden_states input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.

I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.

Ki6an commented 3 years ago

thank you!

@tobigue was able to export mbart to onnx, he might be able to help.

tobigue commented 3 years ago

Hey, I also came across this when trying to adapt the fastT5 code for converting (m)BART to onnx and I think it is due to the fact that the encoder_hidden_states are passed as key_value_states into the MBartAttention down the line and are not used in case past_key_value is given.

So my guess was, that it is not included in the exported graph for that reason. I'll try to share my progress in a Colab notebook tomorrow.

HenryDashwood commented 3 years ago

Ah that's really good to know. Thought I was going mad for a bit! Please do keep us updated!

tobigue commented 3 years ago

@HenryDashwood this is my current progress when converting mbart to onnx: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing

Currently the ONNX model gives a (slightly) different output for the same input than the pytorch model and I'm not sure why. You can see that at the very end of the notebook.

Also, as I had to use the external data format (the mBART model export failed when I did not, because apparently the export is >2GB for some reason, even though T5-large fits into <2GB with fastT5), quantizing the model is not straight forward, as ONNXQuantizer uses infer_shapes, but the external data format is only supported by infer_shapes_path. Also, there is a problem when loading models that are not saved in ./, because f.name is used instead of f. I changed those places in the onnx/onnxruntime code and was able to run the quantization, however, when trying to load the quantized models I get this:

[ONNXRuntimeError] : 1 : FAIL : Load model from models/mbart-large-en-ro/decoder-quantized/model.onnx failed:This is an invalid model. Error: Duplicate definition of name (decoder.embed_tokens.weight).

If you have any insights on the differing outputs, or how to make the quantization work, or manage to export and quantize a BART model in another way, sharing is much appreciated. :)

Cheers!

tobigue commented 3 years ago

The reason the init decoder was so large was that it exported the token embeddings twice. With do_constant_folding=False it only exports them once and then the model can be exported without using the external data format. I also looked at the mBART implementation again and saw that they do not do the rescaling of the output (no * (self.model_dim ** -0.5)), but do add a final_logits_bias. I adjusted the notebook accordingly and will try quantization tomorrow.

tobigue commented 3 years ago

I got the quantization for the exported model to work, but only using onnxruntime==1.6.0. See this issue.

Ki6an commented 3 years ago

cool! I also had some issues with 1.7.0, while using the onnxruntime==1.7.0 for quantizing it created extra models here's the issue . by applying optimize_model=False was able to fix it.

Ki6an commented 3 years ago

constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with using do_constant_folding=True

tobigue commented 3 years ago

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

If you also export the init decoder with constant folding and use the external data format (or inspect the model with netron), we might see the same behaviour.

Ki6an commented 3 years ago

also I noticed that in the notebook

        input_names = [x.name for x in self.decoder.get_inputs()]
        inputs = [
            input_ids.cpu().numpy(),
            attention_mask.cpu().numpy(),
        ] + [
            tensor.cpu().numpy() for tensor in flat_past_key_values
        ]

        decoder_inputs = dict(zip(input_names, inputs))
        decoder_outputs = self.decoder.run(None, decoder_inputs)

in the above code you are using two for loops & zip it might affect the model speed during the inference. fastT5 code for comparison :

    past_key_values = {
        f"pkv_{i}": pkv.cpu().numpy() for i, pkv in enumerate(flat_past_key_values)
    }

    decoder_outputs = self.decoder.run(None, {**decoder_inputs, **past_key_values})

I've not tested how much effect it might have.

Ki6an commented 3 years ago

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

you're right I did not notice it before but after testing got these results.

# t5-large => 
# init_decoder = 1.75G with do_constant_folding=True
# init_decoder = 1.62G with do_constant_folding=False

just a small difference. and for smaller models it's negligible :)

tobigue commented 3 years ago

just a small difference. and for smaller models it's negligible :)

yeah, the English models have a quite small vocabulary compared to the multilingual models like mT5 and mBART.

You are right about the loops + zip, I would suspect that this should be quite fast though, because we are only iterating over ~50 items. When everything works as expected maybe just naming the inputs from 1 to X is the easier and faster way. :+1:

HenryDashwood commented 3 years ago

Nice work! Interestingly I have been able to quantize Bart with onnxruntime 1.7.0 on my Mac without setting optimize_model to False

tobigue commented 3 years ago

Interesting that you don't get the Duplicate definition of name (decoder.embed_tokens.weight) error with 1.7.0. Did you do something different when building the simplified models?

I think optimization works in your case, because the vocabulary of BART is much smaller than mBART, so duplicating the embeddings will probably not result in a model >2GB.

HenryDashwood commented 3 years ago

yeah i suspect that's right. I haven't had to use external_data_format or anything like that. How are you doing in terms of speed? I'm not really seeing any speedups in my code which is probably a bug i've introduced because when i run your notebook, i do. Here's mine: https://colab.research.google.com/drive/1e3b9_a6UvNjJWemqubYA7YKxC7vYrvAX?usp=sharing

Worien commented 3 years ago

@HenryDashwood Hi, I'm trying to export bart-large-cnn model for summarization to onnx and tried to use your colab, but stuck with error on the export step: RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)

Did you face it and how did you fix it?

HenryDashwood commented 3 years ago

Very odd error that I also get sometimes and my only answer at this point is... have you tried running the cell again? That usually sorts it

Worien commented 3 years ago

@HenryDashwood thanks, I was able to run it, but I have a question. Is there a difference when you run the model through generate func or with

ort_session = onnxruntime.InferenceSession("super_resolution.onnx")
ort_outs = ort_session.run(None, ort_inputs)

?

HenryDashwood commented 3 years ago

If I understand your answer correctly, we do both. ort_session.run() gets called inside generate to get predictions from the model. But we need more than just a single set of predictions do do something like beam search. So generate() takes care of that as well

anshoomehra commented 3 years ago

@HenryDashwood great work, I have been struggling on BART Summarization to ONNX & this gives me a ray of hope... [ My earlier tries to onnx never gave final inference summaries similar to non-onnx versions. I am pretty sure, I made some errors refactoring... ]

I downloaded the collab & running it locally as-is without any modifications, it's throwing an error stated below, curious if any particular versions of libraries I should be on?

RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1614378098133/work/torch/csrc/jit/passes/onnx/shape_type_inference.cpp":676, please report a bug to PyTorch.

image

On a side note, can the similar be possibly achieved by JIT Libraries export? I tried that as well & after making model export & Triton Server work, failing at Inference ... https://github.com/triton-inference-server/server/issues/2846

anshoomehra commented 3 years ago

Resolved " RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED " by upgrading to torch 1.8.1

anshoomehra commented 3 years ago

@HenryDashwood et al am I reading it right:

  1. Outcome of your analysis is that ONNX and Quantized are slower than PyTorch ??

  2. If so, what is the best way out to get better performance? I am clocking 900ms to 1.4 sec on T4 per call. I am working on a use case that requires this capability to be enabled in real-time, which is where this study to optimize with the goal to make it 500ms or less. Any suggestions are much appreciated.

  3. I know very little about ONNX and JIT exports but my goal was to export these models, host them on Triton Server with T4s & hopefully get the boost as end goal. Not sure if I am on right path ?

HenryDashwood commented 3 years ago

HI @anshoomehra sorry it took a while to get back. I haven't done much ONNX work in the last couple of weeks so I'm not really sure why I didn't get a speedup. I'll get back to you if I work it out though!

anshoomehra commented 3 years ago

@HenryDashwood appreciate the acknowledgment! Yes, please do continue (if time permits), yours is the only implementation that I have seen working without losing the final output quality & if ONNX brings down the inference time will be significant value addition.

I am dedicating the next few days on it & if I have any breakthroughs shall share back ... Thanks much!!

anshoomehra commented 3 years ago

@HenryDashwood I have been able to make some progress with ONNX, however, shifting to GPUs seems challenging.

I have made ONNX runtime working to engage GPUs using provider=['CUDAExecutionProvider'] (This step itself was issue because of conflicting libraries, we had to install onnxruntime-gpu and uninstall onnx, onnxruntime to make this work)

Now, it seems the way graph is exported some variables are still on CPU causing below error, I tried making model and tokenizer to device cuda along with exported version using CUDAExecutionProvider but it still fails, would you have any immediate thought on what code fix we may need to resolve this??

image

image

image

HenryDashwood commented 3 years ago

Did you try explicitly setting the device of token["input_ids"] and token["attention_mask"]? That's the only thing I can think of.

You might be interested in following this work as it makes progress https://github.com/huggingface/transformers/pull/11786

ayushch3 commented 3 years ago

@HenryDashwood @tobigue I am looking at your implementation of mbart on onnx and wanted to see if you had any feedback on how to make this work for mBart-50 many to many: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt

Specifically, in the generate method, it passes forced_bos_token_id as first generated token to translate into target language. Could you provide any pointers where this parameter can be passed in your implementation: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing#scrollTo=z5j0jM4CRJHj

HenryDashwood commented 3 years ago

Presumably it would be an extra argument passed to the decoder, and would be added to the list of decoder inputs?

sgiorgis-atypon commented 3 years ago

Hello @anshoomehra did you manage to solve the issues with the GPU shifting?

sidsharma72 commented 2 years ago

@HenryDashwood @Ki6an when replicating your colab notebook

I am facing the following error:

image

This seems like an issue similar to the following one: https://github.com/Ki6an/fastT5/issues/18

Following are the library versions I am using:

torch - 1.10.0
transformers - 4.11.3
onnx - 1.10.2
onnxruntime - 1.7.0

TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'

Ki6an commented 2 years ago

@sidsharma72 we had the same issue with T5, was able to resolve it in this commit

make sure to add the cross_attn_head_mask=None, argument in the forward method of OnnxMBart in the notebook, and try again.

sidsharma72 commented 2 years ago

Thanks for the response - I have been able to fix the issue.

mohanvamsibu-kore commented 2 years ago

Hey, @HenryDashwood @tobigue I was able to use the collab for the model - lidiya/bart-large-xsum-samsum. ONNX was considerably better on CPU i.e, 4 secs whereas Pytorch was 11 secs but, over GPU ONNX was taking ~5 secs whereas Pytorch was ~500 ms I have changed all the tensors to GPU & converted all the models to GPU. Can you please let me know if this is expected?

tobigue commented 2 years ago

Hey @mohanvamsibu-kore, I don't have experience with ONNX on GPU, but from what I understand the most efficient way to run stuff on GPU is by using TensorRT. They have an example for T5, which ich pretty similar to BART, here: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace

will-wiki commented 2 years ago

@tobigue @HenryDashwood @Ki6an Thank you very much for your work, very helpful! Now I can convert bart model to onnx model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onnx model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT is not consistent with onnx model :(