AminRezaei0x443 / PyTorch-LIT

Lite Inference Toolkit (LIT) for PyTorch
MIT License
161 stars 6 forks source link

gpt-j generation speed very low #4

Open ghost opened 2 years ago

ghost commented 2 years ago

The output of gpt-j is very slow, for a 200 output token generation it takes about 20 minutes, for 2048 it takes more than an hour, this significantly limits any experimentation with the model.

I checked Gpu utilization during inference which is about 1 percent or 4 percent, and gpu memory usage is below 4GB usage, my system has 8GB Gpu memory, if full Gpu is utilized it may be significantly increase the inference speed

Are their simple hacks to speedup inference time ?

AminRezaei0x443 commented 2 years ago

This is because weight off-loading is done per module rather than in batches. Full GPU use would significantly boost the speed. However, its implementation is not simple, and some calculations must be conducted in this respect. I'll attempt to implement and finish it in the future, but I'm now really busy. Huggingface Accelerate was just released with the same goal. Take a look; that could address the issue, and I'd appreciate hearing if it works or not. (https://huggingface.co/docs/accelerate/index)

ghost commented 2 years ago

I will check on accelerate library, thanks for timely feedback. PyTorch-lit worked very easily and smoothly, highly appreciate your work.

AminRezaei0x443 commented 2 years ago

Thanks! I am delighted to hear that. Please do not hesitate to ask questions and feel free to contribute.