Open Hao-YunDeng opened 1 month ago
same issue @kaiyux
Same issue , can you help to research this issue? thanks @kaiyux
@kaiyux any progress on this issue? Thanks
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:
https://github.com/NVIDIA/TensorRT-LLM.git (bf0a5af) https://github.com/triton-inference-server/tensorrtllm_backend.git (ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)
Model: zephyr-7b-beta
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
step 1:
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16
step 2:
trtllm-build --checkpoint_dir zephyr-7b-beta-converted \ --output_dir zephyr-7b-beta-trt-engine \ --remove_input_padding enable \ --context_fmha enable \ --gpt_attention_plugin float16 \ --gemm_plugin float16 \ --paged_kv_cache enable \ --max_num_tokens 65536 \ --max_batch_size 32 \ --max_input_len 16384 \ --strongly_typed
step 3 tensorrtllm_backend parameters:
MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt
step 4:
A correct curl test (run this in a loop fashion so that you can send a bad request in the middle of the process): curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.7}'
At the same time, send a bad curl request with zero temperature: curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.0}'
Expected behavior
The good curl request should get a response: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n machinery learning is a subset of artificial intelligence that focuses on enabling computer systems to automatically learn and improve"}%
while the bad one should return 400 error.
actual behavior
Both the good and the bad request get 400 error:
400 Client Error: Bad Request for url: http://127.0.0.1:8888/v2/models/ensemble/generate resp: {"error":"in ensemble 'ensemble', Encountered error for requestId 1627051478: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: temperature penalty param (0.000000) is out of limits (0.000000, 340282346638528859811704183484516925440.000000] (/app/tensorrt_llm/cpp/tensorrt_llm/layers/fillBuffers.h:64)\n1 0x7f3978267f71 tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102\n2 0x7f38af8f0624 void tensorrt_llm::layers::FillBuffers::operator()(std::optional<std::vector<float, std::allocator > > const&, float, std::vector<float, std::allocator >&, float , int const, std::pair<float, float> const&, std::string const&) const + 324\n3 0x7f38af8f0c47 tensorrt_llm::layers::DynamicDecodeLayer::setupPenalties(int, int const , tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 1223\n4 0x7f38af901b17 tensorrt_llm::layers::DynamicDecodeLayer::setup(int, int, int const*, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 167\n5 0x7f38af96f04c tensorrt_llm::runtime::GptDecoder::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, int, std::optional<std::shared_ptr > const&) + 572\n6 0x7f38af97dba3 tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator > const&) + 483\n7 0x7f38afc5fadf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&) + 719\n8 0x7f38afc61d0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&) + 5434\n9 0x7f38afc12854 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&, std::set<unsigned long, std::less, std::allocator >&) + 36\n10 0x7f38afc1a984 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404\n11 0x7f3989df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3989df2253]\n12 0x7f3989b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3989b81ac3]\n13 0x7f3989c13850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f3989c13850]"}
additional notes
None