GPT optimisation with TensorRT/FasterTransformer/Triton/???

152334H commented 1 year ago

This issue is likely to take a substantial amount of effort to solve.

Primary problem: GPT2InferenceModel is uniquely subclassed from 🤗's transformers.GPT2Model. The architecture is significantly different from a basic GPT2, and a substantial amount of code needs to be written to make an optimized version of the model work with the usual huggingface generation function.

pepinu commented 1 year ago

I've tried using https://github.com/NVIDIA-AI-IOT/torch2trt but to no avail, too many ops not implemented there:

Warning: Encountered known unsupported method torch.arange
Warning: Encountered known unsupported method torch.zeros
Warning: Encountered known unsupported method torch.addmm
[02/06/2023-19:18:56] [TRT] [E] 4: model.h.0.attn:1:SLICE:GPU: mismatch in number of dimensions for start.
[02/06/2023-19:18:56] [TRT] [E] 4: model.h.0.attn:1:SLICE:GPU: mismatch in number of dimensions for start.
[02/06/2023-19:18:56] [TRT] [E] 4: model.h.0.attn:1:SLICE:GPU: mismatch in number of dimensions for start.
[02/06/2023-19:18:56] [TRT] [E] 4: model.h.0.attn:1:SLICE:GPU: mismatch in number of dimensions for start.

I've however have been successful in using jit torchscript, but there is a big discrepancy in results

/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py:1001: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 524072 / 547840 (95.7%)
Greatest absolute difference: 0.011121749877929688 at index (0, 502, 440) (up to 1e-05 allowed)
Greatest relative difference: 416.0444345892299 at index (0, 78, 657) (up to 1e-05 allowed)

We won't be able to convert whole UnifiedVoice class to tensorRT however, because there is a dependency on long (i64) when calculating cross entropy loss, which is nonnegotiable within pytorch while TensorRT does not support it.

The onnx conversion worked like a charm and I've looked through the hugging face link you referred to somewhere - I will hack something up to maybe get the GPT2 part of the model on TRT and see how it can be connected.

Ryu1845 commented 1 year ago

@pepinu have you made any progress on this?

Ryu1845 commented 1 year ago

Might be interesting https://onnxruntime.ai/docs/performance/tune-performance.html

pepinu commented 1 year ago

https://github.com/152334H/tortoise-tts-fast/issues/3#issuecomment-1425568785 sorry I've got sick last week, will get back to working on this tomorrow 😞

slightly off topic:

I've also got an idea to use a pretrained distilled gpt2 instead of the one we have now, however I am not sure how to fine-tune it. I know that not all training for tortoise was opened, however maybe the part for autoregressive is?

link to distilled gpt2: https://huggingface.co/distilgpt2

here's the issue: https://github.com/neonbjb/tortoise-tts/issues/318

I've run a test and the distilled checkpoint runs twice as fast on some random sequence (for test generation)