[Feature]: tensor parallelism support for bnb quantization (via IBM's fork)

PygmalionAI / aphrodite-engine

Large-scale LLM inference engine

https://aphrodite.pygmalion.chat

GNU Affero General Public License v3.0

1.14k stars 125 forks source link

[Feature]: tensor parallelism support for bnb quantization (via IBM's fork) #767

Open BlairSadewitz opened 1 month ago

BlairSadewitz commented 1 month ago

🚀 The feature, motivation and pitch

I don't know if it's feasible or worthwhile to merge this, as maybe the trees are too divergent, etc., but cherry-picking commits for projects I don't fully understand is somehow a pastime for me, so ...

Alternatives

I could always use one of the other 8.4234234*10^23 quantization methods, but, hey, variety is the spice of life--or something.

Additional context

It doesn't work for pre-quantized models. 🎉~

AlpinDale commented 1 month ago

Perhaps, I'll have to look into it. bnb hasn't been a priority

BlairSadewitz commented 1 month ago

Yeah, I hear you. I'm gonna file a better PR in a second, though, so ... ;-)

AlpinDale commented 1 month ago

FYI I'm working on new kernels for massively speeding up bnb quants + add TP support for them. You might want to hold on for now, or help out with that upcoming PR if you're comfortable with CUDA