Open ChuanhongLi opened 1 week ago
The Hadamard transformation performs x -> HSx, meaning that applying it on sharded inputs would require a matmul involving all the shards. This requires additional communication overhead and code complexity, which is probably why that repo doesn't support it. The additional communication overhead would probably slow down inference too. One way to get around this is by only applying the Hadamard transform per shard, which would give slightly worse theoretical guarantees (since the matrix dimension is now 1/(# shards) of the original dimension) but would probably be fine in practice.
Chu has merged inference code for models quantized by QuIP# into vllm(https://github.com/chu-tianxiang/vllm-gptq), but now the inference code only supports tensor_parallel_size=1. The reason is "Hadamard transform cannot be done for sharded input"(https://github.com/chu-tianxiang/QuIP-for-all)
Do you have any idea about why Hadamard transform cannot be done for sharded input? Or is there a way to support tensor parallelism for inference?
Thanks!