NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
565 stars 42 forks source link

Is Starcoder2 supported? #9

Closed wxsms closed 6 months ago

wxsms commented 6 months ago

I've try latest version of mpt with a starcoder2 model to perform a FP8 quant:

cd llm_ptq

python3 hf_ptq.py \
  --pyt_ckpt_path /mnt/models/source \
  --dtype bf16 \
  --qformat fp8 \
  --export_path /tmp/quant \
  --inference_tensor_parallel 2

which failed as:

723 TensorQuantizers found in model
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Loading extension modelopt_cuda_ext...
[NeMo W 2024-05-16 07:27:17 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
    If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
      warnings.warn(

Loading extension modelopt_cuda_ext_fp8...
[NeMo W 2024-05-16 07:27:55 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
    If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
      warnings.warn(

--------
example test input: ['LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say \'kid star goes off the rails,\'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter\'s latest ». There is life beyond Potter, however. The Londoner has filmed a TV movie called "']
--------
example outputs before ptq: ['The Great Potter Switch," which will air on Channel 4 in the UK. The actor will also be appearing in a new version of the "Harry Potter" series, which will be released in the US. "I\'m very excited about the new series," Radcliffe said in an interview. "It\'s going to be a great experience for me." He added, "I\'m not sure when the new series will be released, but I\'m sure it will']
--------
example outputs after ptq: ['The Great Potter Scandal" and is currently in the process of writing a book. "I\'m not sure when the book will be finished," he said last month. "I\'m sure I will be able to make a few more movies, however." Radcliffe is also set to appear in a new BBC series, "The Great Potter Scandal," which is being written by <NAME> and <NAME>. The series is set to explore the Potter scand']
Unknown model type Starcoder2ForCausalLM. Continue exporting...
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to /tmp/quant/modelopt_model.0.pth using torch.save for further inspection.
Detailed export error: 'unknown:Starcoder2ForCausalLM'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 364, in export_tensorrt_llm_checkpoint
    for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 312, in torch_to_tensorrt_llm_checkpoint
    tensorrt_llm_config = convert_to_tensorrt_llm_config(model_config, tp_size_overwrite)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/tensorrt_llm_utils.py", line 84, in convert_to_tensorrt_llm_config
    "architecture": MODEL_NAME_TO_HF_ARCH_MAP[decoder_type],
KeyError: 'unknown:Starcoder2ForCausalLM'
Quantized model exported to :/tmp/quant. Total time used 69.52810740470886s
pip list:
Package                       Version              Editable project location
----------------------------- -------------------- -----------------------------------------
absl-py                       2.1.0
accelerate                    0.25.0
aiohttp                       3.9.3
aiosignal                     1.3.1
alabaster                     0.7.16
aniso8601                     9.0.1
annotated-types               0.6.0
antlr4-python3-runtime        4.9.3
appdirs                       1.4.4
asttokens                     2.4.1
async-timeout                 4.0.3
attrdict                      2.0.1
attrs                         23.2.0
audioread                     3.0.1
Babel                         2.15.0
bandit                        1.7.7
beautifulsoup4                4.12.3
bitsandbytes                  0.43.1
black                         19.10b0
blinker                       1.4
boto3                         1.34.106
botocore                      1.34.106
braceexpand                   0.1.7
Brotli                        1.1.0
build                         1.2.1
cdifflib                      1.2.6
certifi                       2024.2.2
cffi                          1.16.0
cfgv                          3.4.0
charset-normalizer            3.3.2
click                         8.0.2
cloudpickle                   3.0.0
colorama                      0.4.6
colored                       2.2.4
coloredlogs                   15.0.1
comm                          0.2.2
contourpy                     1.2.1
coverage                      7.5.1
cryptography                  3.4.8
cuda-python                   12.4.0
cutlass_library               3.5.0                /app/tensorrt_llm/3rdparty/cutlass/python
cycler                        0.12.1
Cython                        0.29.37
datasets                      2.14.7
dbus-python                   1.2.18
decorator                     5.1.1
diffusers                     0.27.0
dill                          0.3.7
Distance                      0.1.3
distlib                       0.3.8
distro                        1.7.0
docker-pycreds                0.4.0
docopt                        0.6.2
docutils                      0.21.2
editdistance                  0.8.1
einops                        0.8.0
evaluate                      0.4.2
exceptiongroup                1.2.1
execnet                       2.1.1
executing                     2.0.1
faiss-cpu                     1.8.0
fasttext                      0.9.2
filelock                      3.13.1
fire                          0.6.0
Flask                         2.2.5
Flask-RESTful                 0.3.10
flatbuffers                   24.3.25
fonttools                     4.51.0
frozenlist                    1.4.1
fsspec                        2023.10.0
ftfy                          6.2.0
g2p-en                        2.1.0
gdown                         5.2.0
gevent                        24.2.1
geventhttpclient              2.0.2
gitdb                         4.0.11
GitPython                     3.1.43
graphviz                      0.20.3
greenlet                      3.0.3
grpcio                        1.62.1
h5py                          3.10.0
httplib2                      0.20.2
huggingface-hub               0.21.4
humanfriendly                 10.0
hydra-core                    1.2.0
identify                      2.5.36
idna                          3.6
ijson                         3.2.3
imagesize                     1.4.1
importlib-metadata            4.6.4
inflect                       7.2.1
iniconfig                     2.0.0
ipadic                        1.0.0
ipython                       8.24.0
ipywidgets                    8.1.2
isort                         5.13.2
itsdangerous                  2.2.0
janus                         1.0.0
jedi                          0.19.1
jeepney                       0.7.1
jieba                         0.42.1
Jinja2                        3.1.3
jiwer                         2.5.2
jmespath                      1.0.1
joblib                        1.4.2
jupyterlab_widgets            3.0.10
kaldi-python-io               1.2.2
kaldiio                       2.18.0
keyring                       23.5.0
kiwisolver                    1.4.5
kornia                        0.7.2
kornia_rs                     0.1.3
lark                          1.1.9
latexcodec                    3.0.0
launchpadlib                  1.10.16
lazr.restfulclient            0.14.4
lazr.uri                      1.0.6
lazy_loader                   0.4
Levenshtein                   0.22.0
librosa                       0.10.2.post1
lightning-utilities           0.11.2
llvmlite                      0.42.0
loguru                        0.7.2
lxml                          5.2.2
Markdown                      3.6
markdown-it-py                3.0.0
markdown2                     2.4.13
MarkupSafe                    2.1.5
marshmallow                   3.21.2
matplotlib                    3.9.0
matplotlib-inline             0.1.7
mdurl                         0.1.2
mecab-python3                 1.0.6
megatron-core                 0.2.0
more-itertools                8.10.0
mpi4py                        3.1.5
mpmath                        1.3.0
msgpack                       1.0.8
multidict                     6.0.5
multiprocess                  0.70.15
mypy                          1.10.0
mypy-extensions               1.0.0
nemo_text_processing          1.0.2
nemo-toolkit                  1.20.0
networkx                      3.2.1
ninja                         1.11.1.1
nltk                          3.8.1
nodeenv                       1.8.0
numba                         0.59.1
numpy                         1.23.5
nvidia-ammo                   0.9.3
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-modelopt               0.11.2
nvidia-nccl-cu12              2.20.5
nvidia-nvjitlink-cu12         12.4.99
nvidia-nvtx-cu12              12.1.105
oauthlib                      3.2.0
omegaconf                     2.2.3
onnx                          1.16.0
onnx-graphsurgeon             0.5.2
onnxruntime                   1.16.3
onnxruntime_extensions        0.10.1
OpenCC                        1.1.7
optimum                       1.19.1
packaging                     24.0
pandas                        2.2.1
pangu                         4.0.6.1
parameterized                 0.9.0
parso                         0.8.4
pathspec                      0.12.1
pbr                           6.0.0
pexpect                       4.9.0
pillow                        10.3.0
pip                           24.0
plac                          1.4.3
platformdirs                  4.2.1
pluggy                        1.5.0
polygraphy                    0.49.0
pooch                         1.8.1
portalocker                   2.8.2
pre-commit                    3.7.0
progress                      1.6
prompt-toolkit                3.0.43
protobuf                      4.25.3
psutil                        5.9.8
ptyprocess                    0.7.0
PuLP                          2.8.0
pure-eval                     0.2.2
py                            1.11.0
pyannote.core                 5.0.0
pyannote.database             5.1.0
pyannote.metrics              3.2.1
pyarrow                       16.0.0
pyarrow-hotfix                0.6
pybind11                      2.12.0
pybind11-stubgen              2.5.1
pybtex                        0.24.0
pybtex-docutils               1.0.3
pycparser                     2.22
pydantic                      2.7.1
pydantic_core                 2.18.2
pydub                         0.25.1
Pygments                      2.18.0
PyGObject                     3.42.1
PyJWT                         2.3.0
pynini                        2.1.5
pynvml                        11.5.0
pyparsing                     2.4.7
pypinyin                      0.51.0
pypinyin-dict                 0.8.0
pyproject_hooks               1.1.0
PySocks                       1.7.1
pytest                        8.2.0
pytest-cov                    5.0.0
pytest-forked                 1.6.0
pytest-runner                 6.0.1
pytest-xdist                  3.6.1
python-apt                    2.4.0+ubuntu3
python-dateutil               2.9.0.post0
python-rapidjson              1.16
pytorch-lightning             1.9.4
pytz                          2024.1
PyYAML                        6.0
rapidfuzz                     2.13.7
regex                         2023.12.25
requests                      2.31.0
rich                          13.7.1
rotary-emb                    0.1
rouge_score                   0.1.2
ruamel.yaml                   0.18.6
ruamel.yaml.clib              0.2.8
s3transfer                    0.10.1
sacrebleu                     2.4.2
sacremoses                    0.1.1
safetensors                   0.4.2
scikit-learn                  1.4.2
scipy                         1.13.0
SecretStorage                 3.3.1
sentence-transformers         2.7.0
sentencepiece                 0.1.99
sentry-sdk                    2.1.1
setproctitle                  1.3.3
setuptools                    65.5.1
shellingham                   1.5.4
six                           1.16.0
smmap                         5.0.1
snowballstemmer               2.2.0
sortedcontainers              2.4.0
soundfile                     0.12.1
soupsieve                     2.5
sox                           1.5.0
soxr                          0.3.7
Sphinx                        7.3.7
sphinxcontrib-applehelp       1.0.8
sphinxcontrib-bibtex          2.6.2
sphinxcontrib-devhelp         1.0.6
sphinxcontrib-htmlhelp        2.0.5
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.7
sphinxcontrib-serializinghtml 1.1.10
stack-data                    0.6.3
stevedore                     5.2.0
StrEnum                       0.4.15
sympy                         1.12
tabulate                      0.9.0
tensorboard                   2.16.2
tensorboard-data-server       0.7.2
tensorrt                      9.3.0.post12.dev1
tensorrt_llm                  0.10.0.dev2024043000
termcolor                     2.4.0
text-unidecode                1.3
textdistance                  4.6.2
texterrors                    0.4.4
threadpoolctl                 3.5.0
tiktoken                      0.7.0
tokenizers                    0.15.2
toml                          0.10.2
tomli                         2.0.1
torch                         2.3.0
torchmetrics                  1.4.0.post0
torchvision                   0.18.0
tqdm                          4.66.2
traitlets                     5.14.3
transformers                  4.39.0.dev0
transformers-stream-generator 0.0.4
triton                        2.3.0
tritonclient                  2.43.0
typed-ast                     1.5.5
typeguard                     4.2.1
typer                         0.12.3
typing_extensions             4.11.0
tzdata                        2024.1
urllib3                       2.2.1
virtualenv                    20.26.1
wadllib                       1.3.6
wandb                         0.17.0
wcwidth                       0.2.13
webdataset                    0.1.62
Werkzeug                      3.0.3
wget                          3.2
wheel                         0.42.0
widgetsnbextension            4.0.10
wrapt                         1.16.0
xxhash                        3.4.1
yarl                          1.9.4
youtokentome                  1.0.6
zipp                          1.0.0
zope.event                    5.0
zope.interface                6.2
zstandard                     0.22.0

hardware: 4090x2

cjluo-omniml commented 6 months ago

No unfortunately it's not supported exporting to TRT LLM checkpoint yet, but our quantization works. If your need the support urgently, I recommend you check the python code of the modelopt.torch.export after the pip installation and modify that package to support it yourself. Also feel free to share the diff (patch) here so we can help incorporate into the next release.

wxsms commented 6 months ago

Thanks. May I ask, is there any plans on this for now?

cjluo-omniml commented 6 months ago

There might be a plan for the coming releases

wxsms commented 6 months ago

I manage to finish the checkpoint export by simply adding "unknown:Starcoder2ForCausalLM": "GPTForCausalLM" to MODEL_NAME_TO_HF_ARCH_MAP in tensorrt_llm_utils.py, where the unknown is added by tensorrt_llm/quantization/quantize_by_modelopt.py. And the engine built from that checkpoint works. Thanks for your guidence.

cjluo-omniml commented 6 months ago

thanks @wxsms I will add it to the next release