Cornell-RelaxML / quip-sharp

GNU General Public License v3.0
445 stars 35 forks source link

Is there a way to support tensor parallelism for inference? #58

Open ChuanhongLi opened 1 week ago

ChuanhongLi commented 1 week ago

Chu has merged inference code for models quantized by QuIP# into vllm(https://github.com/chu-tianxiang/vllm-gptq), but now the inference code only supports tensor_parallel_size=1. The reason is "Hadamard transform cannot be done for sharded input"(https://github.com/chu-tianxiang/QuIP-for-all)

image

Do you have any idea about why Hadamard transform cannot be done for sharded input? Or is there a way to support tensor parallelism for inference?

Thanks!

tsengalb99 commented 1 week ago

The Hadamard transformation performs x -> HSx, meaning that applying it on sharded inputs would require a matmul involving all the shards. This requires additional communication overhead and code complexity, which is probably why that repo doesn't support it. The additional communication overhead would probably slow down inference too. One way to get around this is by only applying the Hadamard transform per shard, which would give slightly worse theoretical guarantees (since the matrix dimension is now 1/(# shards) of the original dimension) but would probably be fine in practice.