KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

kirayomato commented 10 months ago

I tried to test the performance of a autogptq LLAVA model，but got this error

Since LLAVA is a VLM model, I manually changed the model_type in config to llama, which allowed the model to be loaded successfully and work fine in other applications, but got error in this.

command

lm_eval --model hf --model_args pretrained=TheBloke/llava-v1.5-13B-GPTQ,autogptq=True --tasks hellaswag --device cuda:0 --batch_size 8

error logs

2024-01-06:22:04:21,895 INFO     [__main__.py:156] Verbosity set to INFO
2024-01-06:22:04:25,208 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-06:22:04:28,279 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-06:22:04:28,280 INFO     [__main__.py:229] Selected Tasks: ['hellaswag']
2024-01-06:22:04:28,283 INFO     [huggingface.py:146] Using device 'cuda:0'
2024-01-06:22:04:34,461 INFO     [_base.py:888] lm_head not been quantized, will be ignored when make_quant.
2024-01-06:22:04:35,435 INFO     [modeling.py:835] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2024-01-06:22:04:39,022 WARNING  [fused_llama_mlp.py:280] Skipping module injection for FusedLlamaMLPForQuantizedModel as currently not supported with use_triton=False.
2024-01-06:22:05:34,664 INFO     [task.py:337] Building contexts for task on rank 0...
2024-01-06:22:05:38,170 INFO     [evaluator.py:314] Running loglikelihood requests
  0%|                                                                                | 0/40168 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\Nusri\.conda\envs\gptq\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Nusri\.conda\envs\gptq\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Nusri\.conda\envs\gptq\Scripts\lm_eval.exe\__main__.py", line 7, in <module>
  File "H:\lm-evaluation-harness\lm_eval\__main__.py", line 231, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "H:\lm-evaluation-harness\lm_eval\utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "H:\lm-evaluation-harness\lm_eval\evaluator.py", line 150, in simple_evaluate
    results = evaluate(
  File "H:\lm-evaluation-harness\lm_eval\utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "H:\lm-evaluation-harness\lm_eval\evaluator.py", line 325, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "H:\lm-evaluation-harness\lm_eval\models\huggingface.py", line 759, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "H:\lm-evaluation-harness\lm_eval\models\huggingface.py", line 973, in _loglikelihood_tokens
    self._model_call(batched_inps, **call_kwargs), dim=-1
  File "H:\lm-evaluation-harness\lm_eval\models\huggingface.py", line 690, in _model_call
    return self.model(inps).logits
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\auto_gptq\modeling\_base.py", line 442, in forward
    return self.model(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1181, in forward
    outputs = self.model(
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1068, in forward
    layer_outputs = decoder_layer(
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 796, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\auto_gptq\nn_modules\fused_llama_attn.py", line 62, in forward
    kv_seq_len += past_key_value[0].shape[-2]
  File "C:\Users\Nusri\.conda\envs\gptq\lib\site-packages\transformers\cache_utils.py", line 78, in __getitem__
    raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
  0%|                                                                                | 0/40168 [00:00<?, ?it/s]
Press any key to continue . . .

StellaAthena commented 10 months ago

This may have been fixed recently in the transformers library https://github.com/huggingface/transformers/issues/27985. Try installed transformers from source and trying again.

kirayomato commented 10 months ago

I solved this bug by replacing https://github.com/EleutherAI/lm-evaluation-harness/blob/ecb1df28f6de2495da560c21b891a00133372337/lm_eval/models/huggingface.py#L492 with self._model = transformers.AutoModelForCausalLM.from_pretrained

baberabb commented 10 months ago

I solved this bug by replacing

https://github.com/EleutherAI/lm-evaluation-harness/blob/ecb1df28f6de2495da560c21b891a00133372337/lm_eval/models/huggingface.py#L492

with self._model = transformers.AutoModelForCausalLM.from_pretrained

I think this is equivalent to autogptq=False?

kirayomato commented 10 months ago

I solved this bug by replacing https://github.com/EleutherAI/lm-evaluation-harness/blob/ecb1df28f6de2495da560c21b891a00133372337/lm_eval/models/huggingface.py#L492

with self._model = transformers.AutoModelForCausalLM.from_pretrained

I think this is equivalent to autogptq=False?

I tried to run without autogptq=true and found that --device cuda:0 do not function, need to add device_map=cuda:0 in model_args. Otherwise, the model will not be loaded correctly in GPU

haileyschoelkopf commented 10 months ago

I tried to run without autogptq=true and found that --device cuda:0 do not function, need to add device_map=cuda:0 in model_args. Otherwise, the model will not be loaded correctly in GPU

I suspect some default behavior has changed to make device_map=auto the default when using quantized models. We can patch around this but I want to find the root cause, as I'm fairly certain the behavior did not used to be this way

DavidePaglieri commented 10 months ago

I am encountering the same error with llama-2-7b-hf. Still looking for a fix

haileyschoelkopf commented 10 months ago

I should have a fix pushed tomorrow for this!

DavidePaglieri commented 9 months ago

I am still encountering this error when testing Llama-2-7b-hf quantized with GPTQ.

  File "/home/user/AutoGPTQ/harness_test.py", line 29, in main
    results = evaluator.simple_evaluate(
  File "/home/user/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/lm-evaluation-harness/lm_eval/evaluator.py", line 151, in simple_evaluate
    results = evaluate(
  File "/home/user/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/lm-evaluation-harness/lm_eval/evaluator.py", line 326, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/user/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1122, in generate_until
    cont = self._model_generate(
  File "/home/user/lm-evaluation-harness/lm_eval/models/huggingface.py", line 716, in _model_generate
    return self.model.generate(
  File "/home/user/AutoGPTQ/auto_gptq/modeling/_base.py", line 448, in generate
    return self.model.generate(**kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
    outputs = self.model(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
    layer_outputs = decoder_layer(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AutoGPTQ/auto_gptq/nn_modules/fused_llama_attn.py", line 62, in forward
    kv_seq_len += past_key_value[0].shape[-2]
  File "/home/user/.local/lib/python3.10/site-packages/transformers/cache_utils.py", line 78, in __getitem__
    raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
  0%|

haileyschoelkopf commented 9 months ago

@DavidePaglieri Thanks for reporting this! Will look into it again.

DavidePaglieri commented 9 months ago

I managed to solve this by downgrading transformers from 4.35 to 4.34

haileyschoelkopf commented 9 months ago

@DavidePaglieri does it also work if you go to transformers version 4.36.2, or any time after https://github.com/huggingface/transformers/issues/27985 was closed?

DavidePaglieri commented 9 months ago

I haven't tried but suppose not since it doesn't work for some of the people on that thread with 4.36.2

haileyschoelkopf commented 9 months ago

accelerate launch lm_eval --model hf --model_args pretrained=TheBloke/Llama-2-7b-GPTQ,autogptq=True --tasks hellaswag --device cuda:0 --batch_size 8

runs for me when I tested on 2 GPUs.

My environment:

Package                       Version          Editable project location
----------------------------- ---------------- ------------------------------------------
absl-py                       2.0.0
accelerate                    0.26.1
aiohttp                       3.9.1
aioprometheus                 23.12.0
aiosignal                     1.3.1
anyio                         4.2.0
asttokens                     2.4.1
async-timeout                 4.0.3
attrs                         23.2.0
auto-gptq                     0.7.0.dev0+cu121
bitsandbytes                  0.42.0
certifi                       2023.11.17
cfgv                          3.4.0
chardet                       5.2.0
charset-normalizer            3.3.2
click                         8.1.7
cmake                         3.28.1
colorama                      0.4.6
comm                          0.2.1
DataProperty                  1.0.1
datasets                      2.16.1
debugpy                       1.8.0
decorator                     5.1.1
dill                          0.3.7
distlib                       0.3.8
einops                        0.7.0
evaluate                      0.4.1
exceptiongroup                1.2.0
executing                     2.0.1
fastapi                       0.109.0
filelock                      3.9.0
flash-attn                    2.4.2
frozenlist                    1.4.1
fsspec                        2023.10.0
gekko                         1.0.6
h11                           0.14.0
httptools                     0.6.1
huggingface-hub               0.20.2
identify                      2.5.33
idna                          3.6
importlib-metadata            7.0.1
ipykernel                     6.29.0
ipython                       8.18.1
jedi                          0.19.1
Jinja2                        3.1.2
joblib                        1.3.2
jsonlines                     4.0.0
jsonschema                    4.20.0
jsonschema-specifications     2023.12.1
jupyter_client                8.6.0
jupyter_core                  5.7.1
lit                           17.0.6
lm_eval                       0.4.0            /weka/hailey/lm-eval/lm-evaluation-harness
lxml                          5.1.0
MarkupSafe                    2.1.3
matplotlib-inline             0.1.6
mbstrdecoder                  1.1.3
mpmath                        1.3.0
msgpack                       1.0.7
multidict                     6.0.4
multiprocess                  0.70.15
nest-asyncio                  1.6.0
networkx                      3.0
ninja                         1.11.1.1
nltk                          3.8.1
nodeenv                       1.8.0
numexpr                       2.8.8
numpy                         1.26.3
nvidia-cublas-cu11            11.10.3.66
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu11        11.7.101
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu11        11.7.99
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu11      11.7.99
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu11             8.5.0.96
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu11             10.9.0.58
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu11            10.2.10.91
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu11          11.4.0.1
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu11          11.7.4.91
nvidia-cusparse-cu12          12.1.0.106
nvidia-nccl-cu11              2.14.3
nvidia-nccl-cu12              2.18.1
nvidia-nvjitlink-cu12         12.3.101
nvidia-nvtx-cu11              11.7.91
nvidia-nvtx-cu12              12.1.105
orjson                        3.9.10
packaging                     23.2
pandas                        2.1.4
parso                         0.8.3
pathvalidate                  3.2.0
peft                          0.7.1
pexpect                       4.9.0
pip                           23.3.1
platformdirs                  4.1.0
portalocker                   2.8.2
pre-commit                    3.6.0
prompt-toolkit                3.0.43
protobuf                      4.25.2
psutil                        5.9.7
ptyprocess                    0.7.0
pure-eval                     0.2.2
pyarrow                       14.0.2
pyarrow-hotfix                0.6
pybind11                      2.11.1
pydantic                      1.10.13
Pygments                      2.17.2
pytablewriter                 1.2.0
python-dateutil               2.8.2
python-dotenv                 1.0.0
pytz                          2023.3.post1
PyYAML                        6.0.1
pyzmq                         25.1.2
quantile-python               1.1
ray                           2.9.0
referencing                   0.32.1
regex                         2023.12.25
requests                      2.31.0
responses                     0.18.0
rouge                         1.0.1
rouge-score                   0.1.2
rpds-py                       0.16.2
sacrebleu                     2.4.0
safetensors                   0.4.1
scikit-learn                  1.3.2
scipy                         1.11.4
sentencepiece                 0.1.99
setuptools                    68.2.2
six                           1.16.0
sniffio                       1.3.0
sqlitedict                    2.1.0
stack-data                    0.6.3
starlette                     0.35.1
sympy                         1.12
tabledata                     1.3.3
tabulate                      0.9.0
tcolorpy                      0.1.4
threadpoolctl                 3.2.0
tiktoken                      0.5.2
tokenizers                    0.15.0
torch                         2.1.2+cu118
tornado                       6.4
tqdm                          4.66.1
tqdm-multiprocess             0.0.11
traitlets                     5.14.1
transformers                  4.36.2
transformers-stream-generator 0.0.4
triton                        2.1.0
typepy                        1.3.2
typing_extensions             4.9.0
tzdata                        2023.4
urllib3                       2.1.0
uvicorn                       0.25.0
uvloop                        0.19.0
virtualenv                    20.25.0
vllm                          0.2.5
watchfiles                    0.21.0
wcwidth                       0.2.13
websockets                    12.0
wheel                         0.41.2
xformers                      0.0.23.post1
xxhash                        3.4.1
yarl                          1.9.4
zipp                          3.17.0
zstandard                     0.22.0

EleutherAI / lm-evaluation-harness

KeyError: 'Cache only has 0 layers, attempted to access layer with index 0' #1250