PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

fix: restore backwards compatibility with sm_60 (P100 and GP100) #444

Closed AlpinDale closed 2 weeks ago

AlpinDale commented 2 weeks ago

Looks like sm_60 can't do dot product with 4 elements. There's a better solution, but this will work for now. I may implement an sm_60-only dot product function here.

resolves #413

dirkson commented 2 weeks ago

I was able to build and run this on 1x and 4x P100's more or less flawlessly with the provided runtime. I tested turboderp's Llama 3 8B exl's and MaziyarPanahi's 8B and 70B gptq.

There was an issue with --context-shift , which is apparently due to Triton currently being limited to Cuda 7.0. The GPTQ's endlessly generated, but that's a known issue with the models tested, so I suspect it was unrelated.

Otherwise I really wasn't able to find any issues. Everything I tried just worked in my setup, or gave reasonable error messages about requiring higher Cuda releases.

Performance, once I adjusted to the way Aphrodite reports performance, was quite pretty solid - The 70B model on -tp 4 ran at 12t/s, up from 3.5 for tabbyapi. The 8B model ran at 50t/s on a single P100, or 25t/s on multiple. It ran at around 40t/s on tabbyapi.