Open ghost opened 2 years ago
This is because weight off-loading is done per module rather than in batches. Full GPU use would significantly boost the speed. However, its implementation is not simple, and some calculations must be conducted in this respect. I'll attempt to implement and finish it in the future, but I'm now really busy. Huggingface Accelerate was just released with the same goal. Take a look; that could address the issue, and I'd appreciate hearing if it works or not. (https://huggingface.co/docs/accelerate/index)
I will check on accelerate library, thanks for timely feedback. PyTorch-lit worked very easily and smoothly, highly appreciate your work.
Thanks! I am delighted to hear that. Please do not hesitate to ask questions and feel free to contribute.
The output of gpt-j is very slow, for a 200 output token generation it takes about 20 minutes, for 2048 it takes more than an hour, this significantly limits any experimentation with the model.
I checked Gpu utilization during inference which is about 1 percent or 4 percent, and gpu memory usage is below 4GB usage, my system has 8GB Gpu memory, if full Gpu is utilized it may be significantly increase the inference speed
Are their simple hacks to speedup inference time ?