NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.66k stars 988 forks source link

Compile with `TORCH_USE_CUDA_DSA` error #162

Open ovcharenkoo opened 1 year ago

ovcharenkoo commented 1 year ago

Hi all,

I am trying to follow the instruction for INT8 weight only + INT8 KV cache for Llama2-13b.

Following the README I run the conversion script from inside the container

python3 hf_llama_convert.py \
-i /code/tensorrt_llm/llms/models--meta-llama--Llama-2-13b-chat-hf \
-o /code/tensorrt_llm/llms/llama2_13B_trt_engines/int8_kv_cache/ \
--calibrate-kv-cache \
-t fp16

And get the following error

=============== Argument ===============
out_dir: /code/tensorrt_llm/llms/llama2_13B_trt_engines/int8_kv_cache/
in_file: /code/tensorrt_llm/llms/models--meta-llama--Llama-2-13b-chat-hf
tensor_parallelism: 1
processes: 4
calibrate_kv_cache: True
smoothquant: None
storage_type: fp16
multi_query_mode: False
========================================
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:51<00:00,  8.61s/it]
calibrating model:   0%|                                                                                                                                             | 0/512 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
calibrating model:   0%|▎                                                                                                                                    | 1/512 [00:05<46:39,  5.48s/it]
Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/llama/hf_llama_convert.py", line 335, in <module>
    hf_gpt_converter(args)
  File "/code/tensorrt_llm/examples/llama/hf_llama_convert.py", line 184, in hf_gpt_converter
    act_range = capture_activation_range(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/code/tensorrt_llm/examples/llama/smoothquant.py", line 199, in capture_activation_range
    model(line_encoded)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1560, in _call_impl
    hook_result = hook(self, args, result)
  File "/code/tensorrt_llm/examples/llama/smoothquant.py", line 170, in stat_input_hook
    stat_tensor(name, x, act_scales, "x")
  File "/code/tensorrt_llm/examples/llama/smoothquant.py", line 164, in stat_tensor
    act_scales[name][key] = torch.max(act_scales[name][key],
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The same error occurs when trying to do the SmoothQuant optimization

python3 hf_llama_convert.py \
-i /code/tensorrt_llm/llms/models--meta-llama--Llama-2-13b-chat-hf \
-o /code/tensorrt_llm/llms/sq-llama2-13b/sq0.8/ \
-sq 0.8 \
--tensor-parallelism 1 \
--storage-type fp16

What should I recompile and how?

Thanks

juney-nvidia commented 1 year ago

@ovcharenkoo Thanks for reporting this. Can you share the following information to us firstly?

Based on the concrete information , we will try to reproduce it firstly.

June

byshiue commented 11 months ago

Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.

ovcharenkoo commented 11 months ago

Hi June,

Branch: release/0.5.0 GPU: H100 CUDA: 12.2 Driver: 525.147.05

Same error remains when trying to do AWQ quantization on HF Llama-7b-chat

python quantize.py --model_dir /code/tensorrt_llm/llms/models--meta-llama--Llama-2-7b-chat-hf \
                   --dtype float16 \
                   --qformat int4_awq \
                   --export_path /code/tensorrt_llm/llms/llama2-7b-4bit-gs128-awq.pt \
                   --calib_size 32 \

I build container with make -C docker release_build CUDA_ARCHS="86-real;90-real"

byshiue commented 11 months ago

Could you take a try on latest main branch with new command.