NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8k stars 876 forks source link

Error running Mistral with ModelRunnerCpp on T4 #1182

Open Aktsvigun opened 5 months ago

Aktsvigun commented 5 months ago

System Info

(1) nvidia-smi

Wed Feb 28 09:25:27 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

(2) nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

(3) pip list

Package                   Version
------------------------- -------------------
accelerate                0.25.0
aiohttp                   3.9.4rc0
aiosignal                 1.3.1
anyio                     4.3.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
async-timeout             4.0.3
attrs                     23.2.0
Babel                     2.14.0
beautifulsoup4            4.12.3
bleach                    6.1.0
Brotli                    1.1.0
build                     1.0.3
cached-property           1.5.2
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
colored                   2.2.4
coloredlogs               15.0.1
comm                      0.2.1
cuda-python               12.3.0
datasets                  2.17.1
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.15.0
dill                      0.3.8
einops                    0.7.0
entrypoints               0.4
evaluate                  0.4.1
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.1
filelock                  3.13.1
flash-attn                2.5.0
flatbuffers               23.5.26
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.10.0
h11                       0.14.0
h2                        4.1.0
hpack                     4.0.0
httpcore                  1.0.4
httpx                     0.27.0
huggingface-hub           0.20.3
humanfriendly             10.0
hyperframe                6.0.1
idna                      3.6
importlib-metadata        7.0.1
importlib_resources       6.1.2
ipykernel                 6.29.3
ipython                   8.22.1
ipywidgets                8.1.2
isoduration               20.11.0
janus                     1.0.0
jedi                      0.19.1
Jinja2                    3.1.3
json5                     0.9.17
jsonpointer               2.4
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
jupyter                   1.0.0
jupyter_client            8.6.0
jupyter-console           6.6.3
jupyter_core              5.7.1
jupyter-events            0.9.0
jupyter-lsp               2.2.3
jupyter_server            2.12.5
jupyter_server_terminals  0.5.2
jupyterlab                4.1.2
jupyterlab_pygments       0.3.0
jupyterlab_server         2.25.3
jupyterlab_widgets        3.0.10
lark                      1.1.9
MarkupSafe                2.1.5
matplotlib-inline         0.1.6
mistune                   3.0.2
mpi4py                    3.1.5
mpmath                    1.3.0
multidict                 6.0.5
multiprocess              0.70.16
nb_conda_kernels          2.3.1
nbclient                  0.8.0
nbconvert                 7.16.1
nbformat                  5.9.2
nest_asyncio              1.6.0
networkx                  3.2.1
ninja                     1.11.1.1
notebook                  7.1.1
notebook_shim             0.2.4
numpy                     1.26.4
nvidia-ammo               0.7.3
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.3.101
nvidia-nvtx-cu12          12.1.105
onnx                      1.15.0
onnx-graphsurgeon         0.3.25
onnxruntime               1.16.3
optimum                   1.17.1
overrides                 7.7.0
packaging                 23.2
pandas                    2.2.1
pandocfilters             1.5.0
parso                     0.8.3
pexpect                   4.9.0
pickleshare               0.7.5
pillow                    10.2.0
pip                       24.0
pkgutil_resolve_name      1.3.10
platformdirs              4.2.0
polygraphy                0.49.0
prometheus_client         0.20.0
prompt-toolkit            3.0.42
protobuf                  5.26.0rc2
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   15.0.0
pyarrow-hotfix            0.6
pycparser                 2.21
Pygments                  2.17.2
pynvml                    11.5.0
pyproject_hooks           1.0.0
PySocks                   1.7.1
python-dateutil           2.8.2
python-json-logger        2.0.7
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     25.1.2
qtconsole                 5.5.1
QtPy                      2.4.1
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
responses                 0.18.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.18.0
safetensors               0.4.2
scipy                     1.12.0
Send2Trash                1.8.2
sentencepiece             0.2.0
setuptools                69.1.1
six                       1.16.0
sniffio                   1.3.1
soupsieve                 2.5
stack-data                0.6.2
sympy                     1.12
tensorrt                  9.2.0.post12.dev5
tensorrt-bindings         9.2.0.post12.dev5
tensorrt-libs             9.2.0.post12.dev5
tensorrt-llm              0.9.0.dev2024022000
terminado                 0.18.0
texttable                 1.7.0
tinycss2                  1.2.1
tokenizers                0.15.2
toml                      0.10.2
tomli                     2.0.1
torch                     2.1.2
torchaudio                2.1.2+cu118
torchprofile              0.0.4
torchvision               0.16.2
tornado                   6.4
tqdm                      4.66.2
traitlets                 5.14.1
transformers              4.36.1
triton                    2.1.0
types-python-dateutil     2.8.19.20240106
typing_extensions         4.10.0
typing-utils              0.1.0
tzdata                    2024.1
uri-template              1.3.0
urllib3                   2.2.1
wcwidth                   0.2.13
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
wheel                     0.42.0
widgetsnbextension        4.0.10
xxhash                    3.4.1
yarl                      1.9.4
zipp                      3.17.0

Who can help?

@byshiue

Information

Tasks

Reproduction

Note: the first 3 steps are taken from official examples.

(1) Quantize with GPTQ: python llama.py PATH_TO_SAVED_HF_MODEL c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors PATH_TO_SAVE_QUANTIZED_MODEL/MODEL_NAME.safetensors

(2) Convert checkpoint: python convert_checkpoint.py --model_dir PATH_TO_SAVED_HF_MODEL --output_dir TLLM_OUTPUT_DIR --dtype float16 --ammo_quant_ckpt_path PATH_TO_SAVE_QUANTIZED_MODEL/MODEL_NAME.safetensors --use_weight_only --weight_only_precision int4_gptq --per_group

(3) Build: trtllm-build --checkpoint_dir TLLM_OUTPUT_DIR --output_dir OUTPUT_DIR --gemm_plugin float16

(4) Run with ModelRunnerCpp:

import torch
from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
if PYTHON_BINDINGS:
    from tensorrt_llm.runtime import ModelRunnerCpp

input_ids = [[1, 26307, 272, 2996, 13]]
runner = ModelRunnerCpp.from_dir(
    engine_dir='./t4_tllm_mistral_v2/',
    lora_dir=None,
    rank=0,
    lora_ckpt_source='hf',
    max_batch_size=1,
    max_input_len=1024,
    max_output_len=32,
    max_beam_width=1,
    max_attention_window_size=4096,
    sink_token_length=None
)

with torch.no_grad():
    outputs = runner.generate(
        torch.LongTensor(input_ids),
        max_new_tokens=32,
        end_id=2,
        pad_id=2,
        do_sample=False,
        streaming=False,
        output_sequence_lengths=True,
        return_dict=True,
    )

Expected behavior

When running this pipeline on A100, everything is normally generated.

actual behavior

When running this on T4 as in the system info above, kernel dies in jupyter notebook; running a python script results in:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: No valid weight only groupwise GEMM tactic(It is usually caused by the failure to execute all candidate configurations of the CUTLASS kernel, please pay attention to the warning information when building the engine.) (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:511)
1       0x7f4e18713864 /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38864) [0x7f4e18713864]
2       0x7f4e1874fdb0 tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 1632
3       0x7f4cc938eba9 /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10cdba9) [0x7f4cc938eba9]
4       0x7f4cc93646af /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10a36af) [0x7f4cc93646af]
5       0x7f4cc9366320 /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10a5320) [0x7f4cc9366320]
6       0x7f4bf9982177 tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 887
7       0x7f4bf9983465 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3189
8       0x7f4bf9984a60 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 2096
9       0x7f4c740f5999 /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x43999) [0x7f4c740f5999]
10      0x7f4c740df7b0 /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d7b0) [0x7f4c740df7b0]
11      0x55a890cd8b15 python(+0x14bb15) [0x55a890cd8b15]
12      0x55a890ccfa62 _PyObject_MakeTpCall + 338
13      0x55a890c7146c python(+0xe446c) [0x55a890c7146c]
14      0x55a890d75439 _PyEval_EvalFrameDefault + 18889
15      0x55a890d543a5 python(+0x1c73a5) [0x55a890d543a5]
16      0x55a890d71cca _PyEval_EvalFrameDefault + 4698
17      0x55a890d521b9 python(+0x1c51b9) [0x55a890d521b9]
18      0x55a890e05f67 PyEval_EvalCode + 135
19      0x55a890e06029 python(+0x279029) [0x55a890e06029]
20      0x55a890e2bc94 python(+0x29ec94) [0x55a890e2bc94]
21      0x55a890e32149 python(+0x2a5149) [0x55a890e32149]
22      0x55a890e322ff _PyRun_SimpleFileObject + 431
23      0x55a890e32403 _PyRun_AnyFileObject + 67
24      0x55a890e33318 Py_RunMain + 920
25      0x55a890e33469 Py_BytesMain + 57
26      0x7f4e1c39709b __libc_start_main + 235
27      0x55a890d9e2d1 python(+0x2112d1) [0x55a890d9e2d1]
[akim-t4-1:16418] *** Process received signal ***
[akim-t4-1:16418] Signal: Aborted (6)
[akim-t4-1:16418] Signal code:  (-6)
[akim-t4-1:16418] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f4e1c6dc730]
[akim-t4-1:16418] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f4e1c3aa8eb]
[akim-t4-1:16418] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f4e1c395535]
[akim-t4-1:16418] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c983)[0x7f4dd904d983]
[akim-t4-1:16418] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x928c6)[0x7f4dd90538c6]
[akim-t4-1:16418] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x919d9)[0x7f4dd90529d9]
[akim-t4-1:16418] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x275)[0x7f4dd90532d5]
[akim-t4-1:16418] [ 7] /opt/conda/envs/llm/lib/python3.10/site-packages/numpy/core/../../../../libgcc_s.so.1(+0x12743)[0x7f4e19850743]
[akim-t4-1:16418] [ 8] /opt/conda/envs/llm/lib/python3.10/site-packages/numpy/core/../../../../libgcc_s.so.1(_Unwind_RaiseException+0xf1)[0x7f4e19850ae5]
[akim-t4-1:16418] [ 9] /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f4dd9053b27]
[akim-t4-1:16418] [10] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38892)[0x7f4e18713892]
[akim-t4-1:16418] [11] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins36WeightOnlyGroupwiseQuantMatmulPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x660)[0x7f4e1874fdb0]
[akim-t4-1:16418] [12] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10cdba9)[0x7f4cc938eba9]
[akim-t4-1:16418] [13] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10a36af)[0x7f4cc93646af]
[akim-t4-1:16418] [14] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.9(+0x10a5320)[0x7f4cc9366320]
[akim-t4-1:16418] [15] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERS2_INS0_16GenerationOutputESaIS8_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x377)[0x7f4bf9982177]
[akim-t4-1:16418] [16] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xc75)[0x7f4bf9983465]
[akim-t4-1:16418] [17] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0x830)[0x7f4bf9984a60]
[akim-t4-1:16418] [18] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x43999)[0x7f4c740f5999]
[akim-t4-1:16418] [19] /opt/conda/envs/llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d7b0)[0x7f4c740df7b0]
[akim-t4-1:16418] [20] python(+0x14bb15)[0x55a890cd8b15]
[akim-t4-1:16418] [21] python(_PyObject_MakeTpCall+0x152)[0x55a890ccfa62]
[akim-t4-1:16418] [22] python(+0xe446c)[0x55a890c7146c]
[akim-t4-1:16418] [23] python(_PyEval_EvalFrameDefault+0x49c9)[0x55a890d75439]
[akim-t4-1:16418] [24] python(+0x1c73a5)[0x55a890d543a5]
[akim-t4-1:16418] [25] python(_PyEval_EvalFrameDefault+0x125a)[0x55a890d71cca]
[akim-t4-1:16418] [26] python(+0x1c51b9)[0x55a890d521b9]
[akim-t4-1:16418] [27] python(PyEval_EvalCode+0x87)[0x55a890e05f67]
[akim-t4-1:16418] [28] python(+0x279029)[0x55a890e06029]
[akim-t4-1:16418] [29] python(+0x29ec94)[0x55a890e2bc94]
[akim-t4-1:16418] *** End of error message ***
Aborted

additional notes

I saw in other threads many problems related to running on T4, and I guess in my example something is not compatible with T4 as well. Yet, I cannot understand which part needs to be modified to tackle this issue (or maybe this is impossible).

Another interesting observation is that it works on shorter sequences (i.e. when input ids contain <= 4 tokens). Still this error is very unlikely to be related to OOM issue (I did monitor the CPU memory and it did not explode).

byshiue commented 5 months ago

For tokens <= 4, we use cuda kernel and it should work on all GPUs. But for tokens > 4, we use cutlass to leverage the tensor core. Since we use some new features only supported when SM >= 80 in cutlass, so the kernel cannot be used on T4.

Aktsvigun commented 5 months ago

Thank you! I can then reformulate the question: are there any options to run ModelRunnerCpp on T4? Maybe with different quantization scheme etc. ?

byshiue commented 5 months ago

For GPTQ with batch_size * max_input_length > 4, we use cutlass to accelerate and some features are only supported when SM >= 80. So, TensorRT-LLM does not support GPTQ on T4 now.

Aktsvigun commented 5 months ago

Hi @byshiue , do I understand it right that T4 can't handle AWQ as well? It produces the following error when using trying to quantize with AWQ (unsure whether I need to create a special topic for this):

Traceback (most recent call last):
  File "/srv/storage/al/repo/TensorRT-LLM/examples/llama/../quantization/quantize.py", line 376, in <module>
    main(args)
  File "/srv/storage/al/repo/TensorRT-LLM/examples/llama/../quantization/quantize.py", line 283, in main
    model = quantize_model(model, quant_cfg, calib_dataloader)
  File "/srv/storage/al/repo/TensorRT-LLM/examples/llama/../quantization/quantize.py", line 220, in quantize_model
    atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/ammo/torch/quantization/model_quant.py", line 112, in quantize
    calibrate(model, config["algorithm"], forward_loop=forward_loop)
  File "ammo/torch/quantization/model_calib.py", line 59, in ammo.torch.quantization.model_calib.calibrate
  File "ammo/torch/quantization/model_calib.py", line 185, in ammo.torch.quantization.model_calib.awq
  File "ammo/torch/quantization/model_calib.py", line 187, in ammo.torch.quantization.model_calib.awq
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "ammo/torch/quantization/model_calib.py", line 330, in ammo.torch.quantization.model_calib.awq_lite
  File "/srv/storage/al/repo/TensorRT-LLM/examples/llama/../quantization/quantize.py", line 216, in calibrate_loop
    model(data)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1044, in forward
    outputs = self.model(
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 255, in forward
    query_states = self.q_proj(hidden_states)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "ammo/torch/quantization/model_calib.py", line 294, in ammo.torch.quantization.model_calib.awq_lite.forward
NotImplementedError: Cannot copy out of meta tensor; no data!

All the system parameters are the same.

byshiue commented 5 months ago

@Aktsvigun T4 does not support AWQ. But the error you report should not be related to T4. If you still have issue, you could create another issue to describe the issue you encounter.

Aktsvigun commented 5 months ago

Thanks @byshiue ! Are there any quantization schemes that T4 does support?

byshiue commented 5 months ago

Since T4 is not officially supported by TensorRT-LLM now, we cannot guarantee which case T4 could support. You could try int8 weight only or SQ.