TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
GPU: NVIDIA RTX 4090
TensorRT-LLM 0.13
quest 1: How can I use the OpenAPI to perform inference on a TensorRT engine model?
root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/apps# python3 openai_server.py /llm/tensorrt_llm/engines/glm/int8
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
[TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 10194 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.28 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4529
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.06 GiB for max tokens in paged KV cache (289856).
[10/19/2024-19:07:16] [TRT-LLM] [E] Failed to load tokenizer from /llm/tensorrt_llm/engines/glm/int8: Unrecognized model in /llm/tensorrt_llm/engines/glm/int8. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chinese_clip, chinese_clip_vision_model, clap, clip, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, graphormer, grounding-dino, groupvit, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava-next-video, llava_next, longformer, longt5, luke, lxmert, m2m_100, mamba, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mistral, mixtral, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nezha, nllb-moe, nougat, nystromformer, olmo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso
Traceback (most recent call last):
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 486, in
entrypoint()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 475, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 846, in from_pretrained
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 997, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in /llm/tensorrt_llm/engines/glm/int8. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chinese_clip, chinese_clip_vision_model, clap, clip, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, graphormer, grounding-dino, groupvit, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava-next-video, llava_next, longformer, longt5, luke, lxmert, m2m_100, mamba, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mistral, mixtral, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nezha, nllb-moe, nougat, nystromformer, olmo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso
[TensorRT-LLM][INFO] Refreshed the MPI local session
question 2: When I use the OpenAI server to perform inference on the HF model, where should I configure the trust_remote_code parameter?
root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/apps# python3 openai_server.py /llm/other/models/glm-4-9b-chat
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
Loading Model: [1/2] Loading HF model to memory
Traceback (most recent call last):
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 486, in
entrypoint()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(args, **kwargs)
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 468, in entrypoint
llm = LLM(model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm.py", line 146, in init
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm.py", line 305, in _build_model
self._engine_dir = model_loader()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1255, in call
return self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1365, in _build_model
build_task(self.get_engine_dir())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1340, in build_task
model_loader(engine_dir=engine_dir)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 960, in call
pipeline()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 902, in call
self.step_forward()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 931, in step_forward
self.step_handlers[self.counter]()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1066, in _load_model_from_hf
model_cls = AutoModelForCausalLM.get_trtllm_model_class(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/automodel.py", line 48, in get_trtllm_model_class
hf_config = transformers.AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 968, in from_pretrained
trust_remote_code = resolve_trust_remote_code(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 640, in resolve_trust_remote_code
raise ValueError(
ValueError: Loading /llm/other/models/glm-4-9b-chat requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.
for the first question, I think you should provide a tokenizer path to the openai_server.py, for example python3 openai_server.py <your engine> --tokenizer <the tokenizer path>.
for the second question, you can pass trust_remote_code=True when LLM instantiation (link)
System Info GPU: NVIDIA RTX 4090 TensorRT-LLM 0.13
quest 1: How can I use the OpenAPI to perform inference on a TensorRT engine model?
root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/apps# python3 openai_server.py /llm/tensorrt_llm/engines/glm/int8 [TensorRT-LLM] TensorRT-LLM version: 0.13.0 [TensorRT-LLM][INFO] Engine version 0.13.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) 40 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2048 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 10194 MiB [TensorRT-LLM][INFO] [MemUsageChange] Allocated 196.76 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10184 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 648.06 KB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 12.28 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 4529 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 32 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.06 GiB for max tokens in paged KV cache (289856). [10/19/2024-19:07:16] [TRT-LLM] [E] Failed to load tokenizer from /llm/tensorrt_llm/engines/glm/int8: Unrecognized model in /llm/tensorrt_llm/engines/glm/int8. Should have a
entrypoint()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main( args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 475, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 846, in from_pretrained
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 997, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in /llm/tensorrt_llm/engines/glm/int8. Should have a
model_type
key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chinese_clip, chinese_clip_vision_model, clap, clip, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, graphormer, grounding-dino, groupvit, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava-next-video, llava_next, longformer, longt5, luke, lxmert, m2m_100, mamba, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mistral, mixtral, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nezha, nllb-moe, nougat, nystromformer, olmo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso Traceback (most recent call last): File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 486, inmodel_type
key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chinese_clip, chinese_clip_vision_model, clap, clip, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, graphormer, grounding-dino, groupvit, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava-next-video, llava_next, longformer, longt5, luke, lxmert, m2m_100, mamba, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mistral, mixtral, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nezha, nllb-moe, nougat, nystromformer, olmo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso [TensorRT-LLM][INFO] Refreshed the MPI local sessionquestion 2: When I use the OpenAI server to perform inference on the HF model, where should I configure the trust_remote_code parameter? root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/apps# python3 openai_server.py /llm/other/models/glm-4-9b-chat [TensorRT-LLM] TensorRT-LLM version: 0.13.0 Loading Model: [1/2] Loading HF model to memory Traceback (most recent call last): File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 486, in
entrypoint()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(args, **kwargs)
File "/llm/tensorrt-llm-0.13.0/examples/apps/openai_server.py", line 468, in entrypoint
llm = LLM(model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm.py", line 146, in init
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm.py", line 305, in _build_model
self._engine_dir = model_loader()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1255, in call
return self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1365, in _build_model
build_task(self.get_engine_dir())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1340, in build_task
model_loader(engine_dir=engine_dir)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 960, in call
pipeline()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 902, in call
self.step_forward()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 931, in step_forward
self.step_handlers[self.counter]()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/llm_utils.py", line 1066, in _load_model_from_hf
model_cls = AutoModelForCausalLM.get_trtllm_model_class(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/automodel.py", line 48, in get_trtllm_model_class
hf_config = transformers.AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 968, in from_pretrained
trust_remote_code = resolve_trust_remote_code(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 640, in resolve_trust_remote_code
raise ValueError(
ValueError: Loading /llm/other/models/glm-4-9b-chat requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option
trust_remote_code=True
to remove this error.