ModelTC / llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
https://arxiv.org/abs/2405.06001
Apache License 2.0
302 stars 31 forks source link

raising exception: Cuda out of memory when quantilizing Mistral-large-2 (123B), using export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 on a H100 #28

Closed BinFuPKU closed 2 months ago

BinFuPKU commented 2 months ago

2024-08-14 18:22:56.727 | INFO | llmc.eval.eval_ppl:init:14 - eval_cfg : {'eval_pos': ['pretrain', 'transformed', 'fake_quant'], 'name': 'wikitext2', 'download': False, 'path': '/home/xiaoi/dq/fubin/alignment/quantization/data/evaluation/wikitext2', 'bs': 1, 'seq_len': 2048} rank0: Traceback (most recent call last): rank0: File "/home/xiaoi/dq/fubin/alignment/quantization/llmc-main/llmc/main.py", line 160, in

rank0: File "/home/xiaoi/dq/fubin/alignment/quantization/llmc-main/llmc/main.py", line 50, in main rank0: ppl = ppl_eval.eval(model) rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/xiaoi/dq/fubin/alignment/quantization/llmc-main/llmc/eval/eval_ppl.py", line 74, in eval

rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2694, in cuda rank0: return super().cuda(*args, **kwargs) rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 915, in cuda rank0: return self._apply(lambda t: t.cuda(device)) rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply

rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply

rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply

rank0: Previous line repeated 2 more times: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 804, in _apply rank0: param_applied = fn(param) rank0: File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/nn/modules/module.py", line 915, in rank0: return self._apply(lambda t: t.cuda(device)) rank0: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 672.00 MiB. GPU E0814 18:23:17.575959 140250494482240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1291815) of binary: /opt/nlp/anaconda3/envs/dq_env_h100_llmc/bin/python Traceback (most recent call last): File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/bin/torchrun", line 8, in sys.exit(main()) File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/nlp/anaconda3/envs/dq_env_h100_llmc/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Harahan commented 2 months ago

We only support AWQ for multiple GPU quantization and a single GPU is enough. Additionally, please use the following eval_pos:


eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: eval data path
    bs: 1
    inference_per_block: True
    seq_len: 2048
BinFuPKU commented 2 months ago

nice,it works well, but cost too much time to quantilize Mistral-large-2 (123B)

Harahan commented 2 months ago

I think evaluation costs too much time,and you can remove the eval_pos in your config for acceleration.