NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.64k stars 986 forks source link

Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552

Open sleepwalker2017 opened 6 months ago

sleepwalker2017 commented 6 months ago

System Info

GPU 2* A30, TRT-LLM branch main, commid id: 66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd

Who can help?

No response

Information

Tasks

Reproduction

MODEL_CHECKPOINT=/data/models/vicuna-7b-v1.5/
CONVERTED_CHECKPOINT=Llama-7b-hf-ckpt

DTYPE=float16
TP=2

echo "step 1: convert checkpoint"
# Build lora enabled engine
python convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
                              --output_dir ${CONVERTED_CHECKPOINT} \
                              --dtype ${DTYPE} \
                              --tp_size ${TP} \
                              --pp_size 1

SOURCE_LORA=/data/Llama2-Chinese-7b-Chat-LoRA/
#SOURCE_LORA=/data/llama2-7b-lora.tar.gz
CPP_LORA=chinese-llama-2-lora-7b-cpp

EG_DIR=/tmp/lora-eg

PP=1
MAX_LEN=1024
MAX_BATCH=16
TOKENIZER=/data/models/vicuna-7b-v1.5/
LORA_ENGINE=Llama-2-7b-hf-engine
NUM_LORAS=(8)
NUM_REQUESTS=200

echo "step 2: trtllm-build"
trtllm-build \
    --checkpoint_dir ${CONVERTED_CHECKPOINT} \
    --output_dir ${LORA_ENGINE} \
    --max_batch_size ${MAX_BATCH} \
    --max_input_len $MAX_LEN \
    --max_output_len $MAX_LEN \
    --gpt_attention_plugin float16 \
    --paged_kv_cache enable \
    --remove_input_padding enable \
    --gemm_plugin float16 \
    --lora_plugin float16 \
    --use_paged_context_fmha enable \
    --use_custom_all_reduce disable \
    --lora_target_modules attn_qkv attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h
echo "step 3: Convert LoRA to cpp format"
# Convert LoRA to cpp format
python ../hf_lora_convert.py \
    -i $SOURCE_LORA \
    --storage-type $DTYPE \
    -o $CPP_LORA

echo "step 4: prepare dataset for non-lora requests"
mkdir -p $EG_DIR/data
python ../../benchmarks/cpp/prepare_dataset.py \
    --output ${EG_DIR}/data/token-norm-dist.json \
    --request-rate -1 \
    --time-delay-dist constant \
    --tokenizer $TOKENIZER \
    token-norm-dist \
    --num-requests $NUM_REQUESTS \
    --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24

echo "step 5: prepare dataset for lora requests"
for nloras in ${NUM_LORAS[@]}; do
    python ../../benchmarks/cpp/prepare_dataset.py \
        --output "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
        --request-rate -1 \
        --time-delay-dist constant \
        --rand-task-id 0 $(( $nloras - 1 )) \
        --tokenizer $TOKENIZER \
        token-norm-dist \
        --num-requests $NUM_REQUESTS \
        --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24
done

mkdir -p ${EG_DIR}/log-base-lora

NUM_LAYERS=32
NUM_LORA_MODS=8
MAX_LORA_RANK=8
EOS_ID=-1
mpirun -n ${TP} --allow-run-as-root --output-filename ${EG_DIR}/log-base-lora \
    ../../cpp/build/benchmarks/gptManagerBenchmark \
    --engine_dir $LORA_ENGINE \
    --type IFB \
    --dataset "${EG_DIR}/data/token-norm-dist-lora-8.json" \
    --lora_host_cache_bytes 8589934592 \
    --lora_num_device_mod_layers $(( 8 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
    --kv_cache_free_gpu_mem_fraction 0.80 \
    --log_level info \
    --eos_id ${EOS_ID}

Expected behavior

Failed to run gptManager benchmark

actual behavior

[TensorRT-LLM][ERROR] Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182)
1       0x5572c6dedde9 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f56c6cd5378 /data/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x69c378) [0x7f56c6cd5378]
3       0x7f56c8c3f03f tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 127
4       0x7f56c8c03078 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 1464
5       0x7f56c8c0342a tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 170
6       0x7f56c64dd253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f56c64dd253]
7       0x7f56c624cac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f56c624cac3]
8       0x7f56c62de850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f56c62de850]

additional notes

none

VincentJing commented 5 months ago

The LoRA cpp format you can refer to this link. the benchmark script you can refer to this.