ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Zero copy may lead to wrong text generation results #145

Open brevity2021 opened 1 year ago

brevity2021 commented 1 year ago

Hi,

I was trying the "zero copy" method in the t5 notebook on a seq2seq transformer model. When I set the clone_tensor to True everything looks fine, just not as much speedup as I expected.

When I tried to set the clone_tensor to False, the text generation gives wrong results (some are repetitive ids). I debugged a bit and found that although the binding inputs are the same as when clone_tensor is True, after run_with_io_binding the results are different. It seems that it could be somehow fixed by not reusing the iobinding, but at some steps, generating a new iobinding, but I really have no clue why.

I'm still playing with it and can post some code snippets later, but wonder if you have encountered something like this (different results when switching between clone_tensor and if you have any suggestions. Thanks!

pommedeterresautee commented 1 year ago

I tested a lot the clone tensor set to False for obvious reason and had no issue at the time. Did you test with ORT 1.11 ?

brevity2021 commented 1 year ago

Thanks for the reply! I was testing with ORT 1.12. It might be due to the model implementation(since I was using Pegasus instead of T5, and had to modify the code a bit to make it work). It's quite some code so I need some time to put them together and showcase the problem. I will update this thread.

brevity2021 commented 1 year ago

@pommedeterresautee Here are the notebooks to replicate the error. To make things easier I use a t5 model for illustration.

I first run make docker_build from a clean transformer-deploy directory, then run docker run -p 8686:8686 -v $PWD/demo/generative-model:/docker_folder ghcr.io/els-rd/transformer-deploy:latest \ bash -c "cd /docker_folder && jupyter notebook --ip 0.0.0.0 --port 8686 --no-browser --allow-root" to start docker container & notebook server.

This notebook is for exporting the ONNX. I am using fp32 instead of trying to do fp16. This notebook shows the inference. When the clone_tensor is set to False, the result contains a lot of 0s. When setting the clone_tensor to True, the inference result is the same as pytorch result.

I am using a g5.2xlarge aws instance.

Can you please help take a look?

c-schumacher commented 1 year ago

@brevity2021 it may not be relevant anymore since your question was a while ago, but I noticed the same thing when modifying the approach to T5 for another model. Downgrading onnxruntime-gpu to 1.11 fixed the issue for me.

c-schumacher commented 1 year ago

Strangely though, even though outputs with the cache are correct after downgrading ORT, that approach is almost 4x slower than using only a decoder without cache support and ~2x slower than the vanilla pytorch implementation. Do you have any idea why that might be @pommedeterresautee? I'm using a similar seq2seq model so not a lot has changed from the T5 build.

brevity2021 commented 1 year ago

@c-schumacher This [notebook] mentions "Version 1.11.1 of ONNX Runtime and older have a bug which makes them much slower when most inputs are used by subgraphs of an If node. Use a version >= 1.12.0 instead." This might be the reason of your slow speed. Although downgrading to 1.11 might be good for the non-copy to work, it introduces other problems. For my case, setting clone_tensor to False does not work for ORT >= 1.12.