Closed kirayomato closed 9 months ago
This may have been fixed recently in the transformers library https://github.com/huggingface/transformers/issues/27985. Try installed transformers from source and trying again.
I solved this bug by replacing https://github.com/EleutherAI/lm-evaluation-harness/blob/ecb1df28f6de2495da560c21b891a00133372337/lm_eval/models/huggingface.py#L492 with self._model = transformers.AutoModelForCausalLM.from_pretrained
I solved this bug by replacing
with
self._model = transformers.AutoModelForCausalLM.from_pretrained
I think this is equivalent to autogptq=False
?
I solved this bug by replacing https://github.com/EleutherAI/lm-evaluation-harness/blob/ecb1df28f6de2495da560c21b891a00133372337/lm_eval/models/huggingface.py#L492
with
self._model = transformers.AutoModelForCausalLM.from_pretrained
I think this is equivalent to
autogptq=False
?
I tried to run without autogptq=true
and found that --device cuda:0
do not function, need to add device_map=cuda:0
in model_args
. Otherwise, the model will not be loaded correctly in GPU
I tried to run without autogptq=true and found that --device cuda:0 do not function, need to add device_map=cuda:0 in model_args. Otherwise, the model will not be loaded correctly in GPU
I suspect some default behavior has changed to make device_map=auto
the default when using quantized models. We can patch around this but I want to find the root cause, as I'm fairly certain the behavior did not used to be this way
I am encountering the same error with llama-2-7b-hf. Still looking for a fix
I should have a fix pushed tomorrow for this!
I am still encountering this error when testing Llama-2-7b-hf quantized with GPTQ.
File "/home/user/AutoGPTQ/harness_test.py", line 29, in main
results = evaluator.simple_evaluate(
File "/home/user/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
return fn(*args, **kwargs)
File "/home/user/lm-evaluation-harness/lm_eval/evaluator.py", line 151, in simple_evaluate
results = evaluate(
File "/home/user/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
return fn(*args, **kwargs)
File "/home/user/lm-evaluation-harness/lm_eval/evaluator.py", line 326, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
File "/home/user/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1122, in generate_until
cont = self._model_generate(
File "/home/user/lm-evaluation-harness/lm_eval/models/huggingface.py", line 716, in _model_generate
return self.model.generate(
File "/home/user/AutoGPTQ/auto_gptq/modeling/_base.py", line 448, in generate
return self.model.generate(**kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
return self.greedy_search(
File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
outputs = self(
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
outputs = self.model(
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
layer_outputs = decoder_layer(
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/AutoGPTQ/auto_gptq/nn_modules/fused_llama_attn.py", line 62, in forward
kv_seq_len += past_key_value[0].shape[-2]
File "/home/user/.local/lib/python3.10/site-packages/transformers/cache_utils.py", line 78, in __getitem__
raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
0%|
@DavidePaglieri Thanks for reporting this! Will look into it again.
I managed to solve this by downgrading transformers from 4.35 to 4.34
@DavidePaglieri does it also work if you go to transformers version 4.36.2
, or any time after https://github.com/huggingface/transformers/issues/27985 was closed?
I haven't tried but suppose not since it doesn't work for some of the people on that thread with 4.36.2
accelerate launch lm_eval --model hf --model_args pretrained=TheBloke/Llama-2-7b-GPTQ,autogptq=True --tasks hellaswag --device cuda:0 --batch_size 8
runs for me when I tested on 2 GPUs.
My environment:
Package Version Editable project location
----------------------------- ---------------- ------------------------------------------
absl-py 2.0.0
accelerate 0.26.1
aiohttp 3.9.1
aioprometheus 23.12.0
aiosignal 1.3.1
anyio 4.2.0
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
auto-gptq 0.7.0.dev0+cu121
bitsandbytes 0.42.0
certifi 2023.11.17
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
cmake 3.28.1
colorama 0.4.6
comm 0.2.1
DataProperty 1.0.1
datasets 2.16.1
debugpy 1.8.0
decorator 5.1.1
dill 0.3.7
distlib 0.3.8
einops 0.7.0
evaluate 0.4.1
exceptiongroup 1.2.0
executing 2.0.1
fastapi 0.109.0
filelock 3.9.0
flash-attn 2.4.2
frozenlist 1.4.1
fsspec 2023.10.0
gekko 1.0.6
h11 0.14.0
httptools 0.6.1
huggingface-hub 0.20.2
identify 2.5.33
idna 3.6
importlib-metadata 7.0.1
ipykernel 6.29.0
ipython 8.18.1
jedi 0.19.1
Jinja2 3.1.2
joblib 1.3.2
jsonlines 4.0.0
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
jupyter_client 8.6.0
jupyter_core 5.7.1
lit 17.0.6
lm_eval 0.4.0 /weka/hailey/lm-eval/lm-evaluation-harness
lxml 5.1.0
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mbstrdecoder 1.1.3
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.4
multiprocess 0.70.15
nest-asyncio 1.6.0
networkx 3.0
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numexpr 2.8.8
numpy 1.26.3
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
orjson 3.9.10
packaging 23.2
pandas 2.1.4
parso 0.8.3
pathvalidate 3.2.0
peft 0.7.1
pexpect 4.9.0
pip 23.3.1
platformdirs 4.1.0
portalocker 2.8.2
pre-commit 3.6.0
prompt-toolkit 3.0.43
protobuf 4.25.2
psutil 5.9.7
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 14.0.2
pyarrow-hotfix 0.6
pybind11 2.11.1
pydantic 1.10.13
Pygments 2.17.2
pytablewriter 1.2.0
python-dateutil 2.8.2
python-dotenv 1.0.0
pytz 2023.3.post1
PyYAML 6.0.1
pyzmq 25.1.2
quantile-python 1.1
ray 2.9.0
referencing 0.32.1
regex 2023.12.25
requests 2.31.0
responses 0.18.0
rouge 1.0.1
rouge-score 0.1.2
rpds-py 0.16.2
sacrebleu 2.4.0
safetensors 0.4.1
scikit-learn 1.3.2
scipy 1.11.4
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
sniffio 1.3.0
sqlitedict 2.1.0
stack-data 0.6.3
starlette 0.35.1
sympy 1.12
tabledata 1.3.3
tabulate 0.9.0
tcolorpy 0.1.4
threadpoolctl 3.2.0
tiktoken 0.5.2
tokenizers 0.15.0
torch 2.1.2+cu118
tornado 6.4
tqdm 4.66.1
tqdm-multiprocess 0.0.11
traitlets 5.14.1
transformers 4.36.2
transformers-stream-generator 0.0.4
triton 2.1.0
typepy 1.3.2
typing_extensions 4.9.0
tzdata 2023.4
urllib3 2.1.0
uvicorn 0.25.0
uvloop 0.19.0
virtualenv 20.25.0
vllm 0.2.5
watchfiles 0.21.0
wcwidth 0.2.13
websockets 12.0
wheel 0.41.2
xformers 0.0.23.post1
xxhash 3.4.1
yarl 1.9.4
zipp 3.17.0
zstandard 0.22.0
I tried to test the performance of a autogptq LLAVA model,but got this error
Since LLAVA is a VLM model, I manually changed the model_type in config to llama, which allowed the model to be loaded successfully and work fine in other applications, but got error in this.
command
error logs