IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.81k stars 145 forks source link

How to run on multi GPUs? #7

Closed TitanSneaker closed 1 year ago

TitanSneaker commented 1 year ago

Im try run opt--30b on 4*2080Ti, However, the following error message appears when loading parameters.

Starting ...
Ready.
Traceback (most recent call last):
  File "opt.py", line 424, in <module>
    quantizers = opt_sequential(model, dataloader, DEV)
  File "/home/cciip/miniconda3/envs/int/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "opt.py", line 83, in opt_sequential
    gptq[name] = GPTQ(subset[name])
  File "/home/cciip/private/tianjie/gptq/gptq.py", line 29, in __init__
    self.H = torch.zeros((self.columns, self.columns), device=self.dev)
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 10.75 GiB total capacity; 9.30 GiB already allocated; 77.62 MiB free; 9.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How can I make it work?

efrantar commented 1 year ago

The GPTQ quantization process is currently only implemented to run on a single GPU and may require substantial amounts of memory to run for larger models (we only support multi-GPU execution for our inference benchmarks). While the GPTQ algorithm could in theory be effectively sharded across GPUs, doing so will be quite tricky due to the matrix-inverse and Cholesky decomposition operations.

yhyu13 commented 1 year ago

@efrantar I am not sure I understand why GPTQ models can't be distributed across devices? And what dimension would the matrix inversion be?