Closed TitanSneaker closed 1 year ago
The GPTQ quantization process is currently only implemented to run on a single GPU and may require substantial amounts of memory to run for larger models (we only support multi-GPU execution for our inference benchmarks). While the GPTQ algorithm could in theory be effectively sharded across GPUs, doing so will be quite tricky due to the matrix-inverse and Cholesky decomposition operations.
@efrantar I am not sure I understand why GPTQ models can't be distributed across devices? And what dimension would the matrix inversion be?
Im try run opt--30b on 4*2080Ti, However, the following error message appears when loading parameters.
How can I make it work?