Setting request_prompt_lengths causes to leak the previous inference's response

twaka commented 1 year ago

Description

When using GPT model and T4 GPU with triton server, setting request_prompt_lengths causes to leak the previous inference's response. In the second request, its response contains the tokens of first request's response in the lasts. test.py is pasted below.

root@6a8ce480b48d:~# python3 test.py
identity_test.py:53: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.ones([input_start_ids.shape[0], 1]).astype(np.bool)
I0928 09:16:04.428523 2756 libfastertransformer.cc:1090] Start to forward
I0928 09:16:04.874056 2756 libfastertransformer.cc:1098] Stop to forward
[[[29744 40684   373  9393   287  2321   416   257  1448   286  4837
     422   262  2059   286  3442    11 14727    11   290   262  2059
     286  2669    13   383  1664   338  3037   318  1912   319  4572
    4673   290  2769  4673 16113    11   290   340   468   587   973
     284  4512   257  1271   286 11666  4430  3341    13   198   198
     464  1664   338  3037   318  1912   319  4572  4673   290  2769
    4673 16113    11   290   340   468   587   973   284  4512   257
    1271   286 11666  4430  3341    13   198   198   464  1664   338
    3037   318  1912   319  4572  4673   290  2769  4673 16113    11
     290   340   468   587   614   329   262  1613  1936   812    13
     198   464  1664   468   587  1498   284  1663   663  6426   379
     257  2494]]]
Deeplearning was founded in 2012 by a group of researchers from the University of California, Berkeley, and the University of Washington. The company's technology is based on machine learning and deep learning algorithms, and it has been used to train a number of artificial intelligence systems.

The company's technology is based on machine learning and deep learning algorithms, and it has been used to train a number of artificial intelligence systems.

The company's technology is based on machine learning and deep learning algorithms, and it has been year for the past five years.
The company has been able to grow its revenue at a rate
I0928 09:16:04.877916 2756 libfastertransformer.cc:1090] Start to forward
I0928 09:16:05.232214 2756 libfastertransformer.cc:1098] Stop to forward
[[[29744 40684   318   257  1049   835   284  2193   546   262   995
    1088   345    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13    13    13    13    13    13    13    13
      13    13    13    13   614   329   262  1613  1936   812    13
     198   464  1664   468   587  1498   284  1663   663  6426   379
     257  2494]]]
Deeplearning is a great way to learn about the world around you.......................................................................................... year for the past five years.
The company has been able to grow its revenue at a rate

Reproduced Steps

prepare GPT model by following the doc https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/gpt_guide.md

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .

docker run -it --rm --shm-size=1g --gpus=all -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash

now in docker

export WORKSPACE=$(pwd)
export SRC_MODELS_DIR=${WORKSPACE}/models
git clone https://github.com/NVIDIA/FasterTransformer.git # Used for convert the checkpoint and triton output
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p ${SRC_MODELS_DIR}/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d ${SRC_MODELS_DIR}/megatron-models/345m
export PYTHONPATH=$PWD/FasterTransformer/:$PYTHONPATH
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gpt/utils/megatron_ckpt_convert.py \
        -i ${SRC_MODELS_DIR}/megatron-models/345m/release/ \
        -o ${WORKSPACE}/all_models/gpt/fastertransformer/1 \
        --trained-tensor-parallel-size 1 \
        --infer-gpu-num 1 \
        --head-num 16

edit config.pbtxt's model_checkpoint_path and run server

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/gpt &

send requests with request_prompt_lengths by following python script test.py

import numpy as np
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype

from tools.utils.gpt_token_encoder import get_encoder
tokenizer = get_encoder("models/gpt2-vocab.json", "models/gpt2-merges.txt")

def prepare_tensor(name, input, protocol):
    client_util = httpclient if protocol == "http" else grpcclient
    t = client_util.InferInput(
        name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def create_inference_server_client(protocol, url, concurrency, verbose):
    client_util = httpclient if protocol == "http" else grpcclient
    if protocol == "http":
        return client_util.InferenceServerClient(url,
                                                concurrency=concurrency,
                                                verbose=verbose)
    elif protocol == "grpc":
        return client_util.InferenceServerClient(url,
                                                verbose=verbose)

def send_requests(text, end_id=50256):
    protocol = "http"
    url = "localhost:8000"
    model_name = "fastertransformer"
    with create_inference_server_client(protocol,
                                        url,
                                        concurrency=1,
                                        verbose=False) as client:
        input_start_ids = np.array([tokenizer.encode(text)]).astype(np.uint32)
        input_len = input_start_ids.shape[1] * np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)
        output_len = 100 * np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)

        runtime_top_k = (
            1 * np.ones([input_start_ids.shape[0], 1])).astype(np.uint32)
        runtime_top_p = 1.0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.float32)
        beam_search_diversity_rate = 0.0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.float32)
        temperature = 1.0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.float32)
        len_penalty = 1.0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.float32)
        repetition_penalty = 1.0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.float32)
        random_seed = 0 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.uint64)
        is_return_log_probs = True * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.bool)
        beam_width = (1 *
                      np.ones([input_start_ids.shape[0], 1])).astype(np.uint32)
        start_ids = 50256 * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)
        end_ids = end_id * \
            np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)
        bad_words_list = np.concatenate([np.zeros([input_start_ids.shape[0], 1, 1]).astype(
            np.int32), (-1 * np.ones([input_start_ids.shape[0], 1, 1])).astype(np.int32)], axis=1)
        stop_word_list = np.concatenate([np.zeros([input_start_ids.shape[0], 1, 1]).astype(
            np.int32), (-1 * np.ones([input_start_ids.shape[0], 1, 1])).astype(np.int32)], axis=1)
        request_prompt_embedding = 0.5 * np.ones([input_start_ids.shape[0], 20, 4096]).astype(np.float16)
        request_prompt_lengths = 20 * np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)
        request_prompt_type = 0 * np.ones([input_start_ids.shape[0], 1]).astype(np.uint32)

        input_data = input_start_ids
        inputs = [
            prepare_tensor("input_ids", input_data, protocol),
            prepare_tensor("input_lengths", input_len, protocol),
            prepare_tensor("request_output_len", output_len, protocol),
            prepare_tensor("runtime_top_k", runtime_top_k, protocol),
            prepare_tensor("runtime_top_p", runtime_top_p, protocol),
            prepare_tensor("beam_search_diversity_rate",
                           beam_search_diversity_rate, protocol),
            prepare_tensor("temperature", temperature, protocol),
            prepare_tensor("len_penalty", len_penalty, protocol),
            prepare_tensor("repetition_penalty", repetition_penalty, protocol),
            prepare_tensor("random_seed", random_seed, protocol),
            prepare_tensor("is_return_log_probs", is_return_log_probs, protocol),
            prepare_tensor("beam_width", beam_width, protocol),
            prepare_tensor("start_id", start_ids, protocol),
            prepare_tensor("end_id", end_ids, protocol),
            prepare_tensor("bad_words_list", bad_words_list, protocol),
            prepare_tensor("stop_words_list", stop_word_list, protocol),
            prepare_tensor("request_prompt_embedding", request_prompt_embedding, protocol),
            prepare_tensor("request_prompt_lengths", request_prompt_lengths, protocol),
            prepare_tensor("request_prompt_type", request_prompt_type, protocol)
        ]

        result = client.infer(model_name, inputs)
        print(result.as_numpy("output_ids"))
        print(tokenizer.decode(result.as_numpy("output_ids")[0][0]))

send_requests("Deeplearning was", end_id=50256)
send_requests("Deeplearning is", end_id=13)

twaka commented 1 year ago

After digging into the issue, I noticed that there are possible cause of bug in ParallelGpt.cc if it helps.

https://github.com/NVIDIA/FasterTransformer/blob/bc077a9078b99244c69cfe0dc44af86fd974bc71/src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc#L1002-L1009 Since prompt is added, it might need to account for prompt lengths like GPTJ or GPTNeoX doing.

https://github.com/NVIDIA/FasterTransformer/blob/bc077a9078b99244c69cfe0dc44af86fd974bc71/src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc#L1334-L1336 I think prefix_soft_prompt_embedding is not defined as parameter in docs. It makes max_prefix_soft_prompt_length always 0.

byshiue commented 1 year ago

This issue is fixed in FT v5.2 release and ft backend 1.3 release. Please try again.

twaka commented 1 year ago

Thank you very much. v5.2 solves the issue.

NVIDIA / FasterTransformer

Setting request_prompt_lengths causes to leak the previous inference's response #332

Description

Reproduced Steps