Closed claudiosv closed 1 year ago
I wanted to add some goals and motivations for this PR as well.
As for the warmup thing, it is only needed for TensorRT. Therefore, I have removed it from the others. I originally left it in because it was a handy debugging tool but now that I have written tests I have a way to catch failed models. See documentation regarding warmup. My intention is to make the warmup function run one sequence of 1 token and then one sequence with model's max_length number of tokens. This follows the documentation ensuring that the model provided is ready for use.
[...] first build the TensorRT engine with an input of small shape, and then with an input of large shape to have an engine valid for all shapes inbetween. This allows to avoid rebuilding the engine for new small and large shapes, which is unwanted once the model is deployed for inference.
If we don't warmup, the unexpected result to users will be that they pass in different sequences, say lengths 10,20,30 and get abysmal performance as each sequence is slightly longer than the previous, resulting in a full TensorRT model build which is very, very slow.
Got it, makes sense. Thanks for explaining. Still suggest we use the existing codepath in HuggingfacePredictor
for the warmup as we discussed above.
I wanted to add some goals and motivations for this PR as well.
1. I would like to run large models on Mariana/Houston machine for bigcalibrate experiments. To this end, I would like to support ORT CUDA and/or ORT TensorRT backends, and some models may even require 8/4bit quant. 2. We told SRI we would try CodeGen2 and CodeT5+ models. This also means that any model specific hacks such as prepending a token without the user knowing, should be avoided. 3. I would like a model instantiated by lmwrapper to be ready for use in bigcalibrate with relegating as few model specific flags/kwargs/etc to bigcalibrate. Ditto for TensorRT warmup if I can get that to work.
As for the warmup thing, it is only needed for TensorRT. Therefore, I have removed it from the others. I originally left it in because it was a handy debugging tool but now that I have written tests I have a way to catch failed models. See documentation regarding warmup. My intention is to make the warmup function run one sequence of 1 token and then one sequence with model's max_length number of tokens. This follows the documentation ensuring that the model provided is ready for use.
If we don't warmup, the unexpected result to users will be that they pass in different sequences, say lengths 10,20,30 and get abysmal performance as each sequence is slightly longer than the previous, resulting in a full TensorRT model build which is very, very slow.