Faster inference - Githubissues

claudiosv commented 1 year ago

As for the warmup thing, it is only needed for TensorRT. Therefore, I have removed it from the others. I originally left it in because it was a handy debugging tool but now that I have written tests I have a way to catch failed models. See documentation regarding warmup. My intention is to make the warmup function run one sequence of 1 token and then one sequence with model's max_length number of tokens. This follows the documentation ensuring that the model provided is ready for use.

[...] first build the TensorRT engine with an input of small shape, and then with an input of large shape to have an engine valid for all shapes inbetween. This allows to avoid rebuilding the engine for new small and large shapes, which is unwanted once the model is deployed for inference.

If we don't warmup, the unexpected result to users will be that they pass in different sequences, say lengths 10,20,30 and get abysmal performance as each sequence is slightly longer than the previous, resulting in a full TensorRT model build which is very, very slow.

claudiosv commented 1 year ago

I wanted to add some goals and motivations for this PR as well.

I would like to run large models on Mariana/Houston machine for bigcalibrate experiments. To this end, I would like to support ORT CUDA and/or ORT TensorRT backends, and some models may even require 8/4bit quant.
We told SRI we would try CodeGen2 and CodeT5+ models. This also means that any model specific hacks such as prepending a token without the user knowing, should be avoided.
I would like a model instantiated by lmwrapper to be ready for use in bigcalibrate with relegating as few model specific flags/kwargs/etc to bigcalibrate. Ditto for TensorRT warmup if I can get that to work.

DNGros commented 1 year ago

As for the warmup thing, it is only needed for TensorRT. Therefore, I have removed it from the others. I originally left it in because it was a handy debugging tool but now that I have written tests I have a way to catch failed models. See documentation regarding warmup. My intention is to make the warmup function run one sequence of 1 token and then one sequence with model's max_length number of tokens. This follows the documentation ensuring that the model provided is ready for use.

[...] first build the TensorRT engine with an input of small shape, and then with an input of large shape to have an engine valid for all shapes inbetween. This allows to avoid rebuilding the engine for new small and large shapes, which is unwanted once the model is deployed for inference.

If we don't warmup, the unexpected result to users will be that they pass in different sequences, say lengths 10,20,30 and get abysmal performance as each sequence is slightly longer than the previous, resulting in a full TensorRT model build which is very, very slow.

Got it, makes sense. Thanks for explaining. Still suggest we use the existing codepath in HuggingfacePredictor for the warmup as we discussed above.

DNGros commented 1 year ago

I wanted to add some goals and motivations for this PR as well.

1. I would like to run large models on Mariana/Houston machine for bigcalibrate experiments. To this end, I would like to support ORT CUDA and/or ORT TensorRT backends, and some models may even require 8/4bit quant.

2. We told SRI we would try CodeGen2 and CodeT5+ models. This also means that any model specific hacks such as prepending a token without the user knowing, should be avoided.

3. I would like a model instantiated by lmwrapper to be ready for use in bigcalibrate with relegating as few model specific flags/kwargs/etc to bigcalibrate. Ditto for TensorRT warmup if I can get that to work.

makes sense. I would still say let's get the causal ones working first before thinking about the T5.
got it. Seems reasonable. Agreed warmup good idea now. Also agree that idea behind the wrapper is to give sensible defaults (also to avoid kwargs whenever possible, preferring object oriented and statically checkable parameters instead) Edit: clarifying at least the kwargs on public interfaces. The internal kwargs here are fine.

DNGros / lmwrapper

Faster inference #1