Open catid opened 5 months ago
GEMVFast is not implemented in vLLM yet
I'm planning a PR to implement this functionality in vLLM
I'm planning a PR to implement this functionality in vLLM
Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch
Currently, there is no option for it. You will have to wait until other software packages support it.
@catid What RAM do you use for that? My 31GB gets overfilled when quantizing the model
I tried these two quantization approaches:
Both result in the same error in vLLM:
gemm works fine though