CUDA error: an illegal memory access was encountered

Thank you for your excellent work!

Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated.

1. Reproduce the bug

I followed the provided instructions and set up the environment for gradient/quant/deployment. The gradient and quantization processes performed well; I successfully computed the gradient and built the quantizer. However, when I tested the deployment code using the following instructions, I encountered the error message "CUDA error: an illegal memory access was encountered."

cp ../quant/quantizers.pickle .

CUDA_VISIBLE_DEVICES=1 python llama.py JackFram/llama-160m wikitext2 \
    --abits 4 \
    --include_sparse \
    --sparsity-threshold 0.99 \
    --quantizer-path quantizers.pickle \
    --benchmark 128 \
    --check

2. Error logs

The detailed error logs are shown as follows:

/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
splitting into 1 GPUs
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Load quantizers.
k:  model.layers.0.self_attn.k_proj
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:449: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_upper = torch.tensor(quantizer[0]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:450: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_lower = torch.tensor(quantizer[1]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:484: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lut_tmp = torch.tensor(self.lut)
k:  model.layers.0.self_attn.v_proj
k:  model.layers.1.self_attn.k_proj
k:  model.layers.1.self_attn.v_proj
k:  model.layers.2.self_attn.k_proj
k:  model.layers.2.self_attn.v_proj
k:  model.layers.3.self_attn.k_proj
k:  model.layers.3.self_attn.v_proj
k:  model.layers.4.self_attn.k_proj
k:  model.layers.4.self_attn.v_proj
k:  model.layers.5.self_attn.k_proj
k:  model.layers.5.self_attn.v_proj
k:  model.layers.6.self_attn.k_proj
k:  model.layers.6.self_attn.v_proj
k:  model.layers.7.self_attn.k_proj
k:  model.layers.7.self_attn.v_proj
k:  model.layers.8.self_attn.k_proj
k:  model.layers.8.self_attn.v_proj
k:  model.layers.9.self_attn.k_proj
k:  model.layers.9.self_attn.v_proj
k:  model.layers.10.self_attn.k_proj
k:  model.layers.10.self_attn.v_proj
k:  model.layers.11.self_attn.k_proj
k:  model.layers.11.self_attn.v_proj
Model type : llama
Benchmarking ...
Traceback (most recent call last):
  File "/root/KVQuant/deployment/llama.py", line 224, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/root/KVQuant/deployment/llama.py", line 82, in benchmark
    out = model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2683, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2565, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2250, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 1965, in forward
    attn_weights = self.kcache.forward_fused_sparse(query_states, key_states)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 710, in forward_fused_sparse
    outliers_rescaled = outliers_rescaled.cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

According to my understanding, it appears that the error is somehow related to CUDA kernel implementation "vecquant4appendvecKsparse," which modifies the variable "outliers_rescaled".

3. Environment

OS: Ubuntu 20.04 LTS
GPU: Tesla P100-PCIE-16GB
Packages (pip list):

Package                  Version     Editable project location
------------------------ ----------- -------------------------------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
datasets                 2.19.0
dill                     0.3.8
einops                   0.8.0
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
huggingface-hub          0.23.0
idna                     3.7
Jinja2                   3.1.3
kvquant                  0.1.0       /root/KVQuant/deployment
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.2.2
pip                      23.3.1
protobuf                 5.26.1
psutil                   5.9.8
pyarrow                  16.0.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
quant-cuda               0.0.0
regex                    2024.4.28
requests                 2.31.0
safetensors              0.4.3
sentencepiece            0.2.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
tokenizers               0.15.2
torch                    2.3.0
tqdm                     4.66.4
transformers             4.38.0.dev0 /root/KVQuant/deployment/transformers
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

LLM weights: https://huggingface.co/JackFram/llama-160m

Due to hardware constraints, I intend to perform a quick test on the smaller model weights as indicated above. KVQuant is expected to work properly, as the smaller model differs from Llama-7B only in terms of weight size while sharing a similar architecture.

4、Related solutions that I have tried

As suggested in the discussion related to this CUDA error on https://github.com/pytorch/pytorch/issues/21819 , I have updated CUDA, torch, and other relevant components to the latest versions. However, I am still encountering the same error.

What's the potential problem of this error and how could I solve it?

Thanks in advance!

SqueezeAILab / KVQuant