NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.35k stars 794 forks source link

[Bug] Zero temperature curl request affects non-zero temperature requests #1632

Open Hao-YunDeng opened 1 month ago

Hao-YunDeng commented 1 month ago

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:

https://github.com/NVIDIA/TensorRT-LLM.git (bf0a5af) https://github.com/triton-inference-server/tensorrtllm_backend.git (ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)

Model: zephyr-7b-beta

Who can help?

@kaiyux

Information

Tasks

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted \ --output_dir zephyr-7b-beta-trt-engine \ --remove_input_padding enable \ --context_fmha enable \ --gpt_attention_plugin float16 \ --gemm_plugin float16 \ --paged_kv_cache enable \ --max_num_tokens 65536 \ --max_batch_size 32 \ --max_input_len 16384 \ --strongly_typed

step 3 tensorrtllm_backend parameters:

MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

step 4:

A correct curl test (run this in a loop fashion so that you can send a bad request in the middle of the process): curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.7}'

At the same time, send a bad curl request with zero temperature: curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.0}'

Expected behavior

The good curl request should get a response: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n machinery learning is a subset of artificial intelligence that focuses on enabling computer systems to automatically learn and improve"}%

while the bad one should return 400 error.

actual behavior

Both the good and the bad request get 400 error:

400 Client Error: Bad Request for url: http://127.0.0.1:8888/v2/models/ensemble/generate resp: {"error":"in ensemble 'ensemble', Encountered error for requestId 1627051478: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: temperature penalty param (0.000000) is out of limits (0.000000, 340282346638528859811704183484516925440.000000] (/app/tensorrt_llm/cpp/tensorrt_llm/layers/fillBuffers.h:64)\n1 0x7f3978267f71 tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102\n2 0x7f38af8f0624 void tensorrt_llm::layers::FillBuffers::operator()(std::optional<std::vector<float, std::allocator > > const&, float, std::vector<float, std::allocator >&, float, int const, std::pair<float, float> const&, std::string const&) const + 324\n3 0x7f38af8f0c47 tensorrt_llm::layers::DynamicDecodeLayer::setupPenalties(int, int const, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 1223\n4 0x7f38af901b17 tensorrt_llm::layers::DynamicDecodeLayer::setup(int, int, int const*, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 167\n5 0x7f38af96f04c tensorrt_llm::runtime::GptDecoder::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, int, std::optional<std::shared_ptr > const&) + 572\n6 0x7f38af97dba3 tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator > const&) + 483\n7 0x7f38afc5fadf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&) + 719\n8 0x7f38afc61d0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&) + 5434\n9 0x7f38afc12854 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&, std::set<unsigned long, std::less, std::allocator >&) + 36\n10 0x7f38afc1a984 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404\n11 0x7f3989df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3989df2253]\n12 0x7f3989b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3989b81ac3]\n13 0x7f3989c13850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f3989c13850]"}

additional notes

None

dafu-wu commented 1 month ago

same issue @kaiyux

fan-niu commented 1 month ago

Same issue , can you help to research this issue? thanks @kaiyux

Hao-YunDeng commented 1 month ago

@kaiyux any progress on this issue? Thanks

github-actions[bot] commented 3 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."