NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.51k stars 964 forks source link

Phi-3-mini-128k error #2313

Closed scuizhibin closed 1 week ago

scuizhibin commented 2 weeks ago

envirmonent: hardware: rtx4090 Driver Version: 550.107.02 software: cuda release 12.4, V12.4.131
absl-py 2.1.0 accelerate 0.31.0 aenum 3.1.15 aiofiles 23.2.1 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiohttp-sse-client 0.2.1 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 24.2.0 build 1.2.1 certifi 2024.8.30 charset-normalizer 3.3.2 click 8.1.7 click-option-group 0.5.6 cloudpickle 3.0.0 colored 2.2.4 coloredlogs 15.0.1 contourpy 1.3.0 cuda-python 12.6.0 cycler 0.12.1 datasets 2.14.5 diffusers 0.30.2 dill 0.3.7 distro 1.9.0 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.2 fastapi 0.112.2 ffmpy 0.4.0 filelock 3.15.4 flash-attn 2.5.8 flatbuffers 24.3.25 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2023.6.0 gradio 4.36.0 gradio_client 1.0.1 h11 0.14.0 h5py 3.10.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 humanfriendly 10.0 idna 3.8 importlib_metadata 8.4.0 importlib_resources 6.4.4 janus 1.0.0 Jinja2 3.1.4 jiter 0.5.0 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 lark 1.2.2 latex2mathml 3.77.0 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdtex2html 1.3.0 mdurl 0.1.2 mpi4py 4.0.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 narwhals 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-modelopt 0.15.1 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 onnx 1.16.2 onnx-simplifier 0.4.36 onnxruntime-gpu 1.19.2 openai 1.39.0 optimum 1.22.0 orjson 3.10.7 packaging 24.1 pandas 2.2.2 pillow 10.3.0 pip 22.0.2 polygraphy 0.49.9 protobuf 5.28.0 psutil 6.0.0 PuLP 2.9.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pydantic 2.9.0b2 pydantic_core 2.23.1 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.1.4 pyproject_hooks 1.1.0 python-dateutil 2.9.0.post0 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.2 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 responses 0.18.0 rich 13.8.0 rouge-score 0.1.2 rpds-py 0.20.0 ruff 0.6.3 safetensors 0.4.4 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 59.6.0 shellingham 1.5.4 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.3 starlette 0.38.4 StrEnum 0.4.15 sympy 1.13.2 tensorrt 10.3.0 tensorrt-cu12 10.3.0 tensorrt-cu12-bindings 10.3.0 tensorrt-cu12-libs 10.3.0 tensorrt-llm 0.13.0.dev2024081300 tiktoken 0.6.0 timm 1.0.9 tokenizers 0.19.1 tomli 2.0.1 tomlkit 0.12.0 torch 2.4.0 torchao 0.5.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.41.2 transformers-stream-generator 0.0.5 triton 3.0.0 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.6 websockets 11.0.3 wheel 0.37.1 xxhash 3.5.0 yarl 1.9.7 zipp 3.20.1

When I quantify the Phi3-min-128k model, I use two commands 一、Command 1: python3 ../TensorRT-LLM/examples/quantization/quantize.py --model_dir ./Phi-3-mini-128k-instruct/ --output_dir ./phi_out/ --dtype float16 --qformat fp8 --kv_cache_dtype fp8 ** Terminal output:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'. table = cls._concat_blocks(blocks, axis=0) Inserted 387 quantizers /usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py:131: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions. return calibrate(model, config["algorithm"], forward_loop=forward_loop) [10/10/2024-10:11:33] You are not running the flash-attention implementation, expect numerical differences. current rank: 0, tp rank: 0, pp rank: 0 /usr/lib/python3.10/tempfile.py:1008: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp481ehvj0'> _warnings.warn(warn_message, ResourceWarning)

二、Command 2: trtllm-build --checkpoint_dir ./phi_out/ --output_dir ./phi_engine/ --gemm_plugin auto --max_batch_size 8 --max_input_len 1024 --max_seq_len 2048

** Terminal output: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 1223, in slice input_ndim = input.ndim() AttributeError: 'NoneType' object has no attribute 'ndim'

how to solve this error ?

nv-guomingz commented 2 weeks ago

Thanks @scuizhibin for reporting such issue. I can reproduce it on my local side.

Here is a quick war for fixing this issue, please update your ./phi_out/config.json by replacing position_embedding_type field value from rope_gpt_neox to long_rope.

Image

Superjomn commented 1 week ago

Close since no recent update, please feel free to reopen it later.