I was able to benchmark llama 2 7B chat (int 8) and was able to get ~600 tokens in about 12s on an A100 GPU whereas the HF pipeline takes about 25s for the same input and params.
However, when I try the llama v2 70B chat model (int 8) its extremely slow (~90s) for 500 tokens vs HF pipeline which takes ~32s (although pipeline uses multiple GPUs so its not a fair comparison?). Is this expected or am I doing something wrong?
I was able to benchmark llama 2 7B chat (int 8) and was able to get ~600 tokens in about 12s on an A100 GPU whereas the HF pipeline takes about 25s for the same input and params.
However, when I try the llama v2 70B chat model (int 8) its extremely slow (~90s) for 500 tokens vs HF pipeline which takes ~32s (although pipeline uses multiple GPUs so its not a fair comparison?). Is this expected or am I doing something wrong?
Here's my code: