NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.64k stars 986 forks source link

Using Phi-2 with A100 (160GB) and Triton server 24.02 hangs indefinitely #1507

Open kelkarn opened 6 months ago

kelkarn commented 6 months ago

Environment

If applicable, please include the following:

CPU architecture: x86_64 CPU/Host memory size: 440 GiB memory

GPU properties

GPU name: A100 GPU memory size: 160GB I am using the Azure offering of this GPU: Standard NC48ads A100 v4 (48 vcpus, 440 GiB memory)

Libraries

TensorRT-LLM branch or tag: v0.8.0 Container used: 24.02-trtllm-python-py3

NVIDIA driver version: Driver Version: 535.161.07

OS: Ubuntu 22.04 (Jammy)

Reproduction Steps

Followed steps here: https://github.com/NVIDIA/TensorRT-LLM/tree/5955b8afbad2ddcc3156202b16c567e94c52248f/examples/phi

From within the examples/phi folder:

  1. Build checkpoint with tp_size = 2, pp_size = 1

    python3 ./convert_checkpoint.py --model_dir "microsoft/phi-2" --output_dir ./phi-2-checkpoint --dtype float16 --tp_size 2
  2. Build engine

    trtllm-build \
    --checkpoint_dir ./phi-2-checkpoint \
    --output_dir ./phi-2-engine-0 \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --remove_input_padding enable \
    --max_input_len 2048 \
    --max_output_len 1024 \
    --workers 2 \ # I set this = tp_size
    --max_batch_size 8
  3. Run in Triton (after copying to models folder)

# world_size = tp_size * pp_size
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models/phi2 --tensorrt_llm_model_name=phi2

Expected Behavior

I expected Triton server to start normally and show the GRPC/Metrics/HTTP port numbers at the end (8001, 8002, 8000).

Actual Behavior

Triton server just hangs. I am using the 24.02-trtllm-python-py3 version. Here are the raw logs with --log-verbose=1:

root@40a6d0b00a92:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models/phi2 --tensorrt_llm_model_name=phi2
root@40a6d0b00a92:/tensorrtllm_backend# I0426 21:15:02.644640 107 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0426 21:15:02.646080 108 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0426 21:15:03.216801 107 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7787ce000000' with size 268435456
I0426 21:15:03.218430 108 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7807de000000' with size 268435456
I0426 21:15:03.223436 107 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0426 21:15:03.223448 107 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0426 21:15:03.224890 108 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0426 21:15:03.224905 108 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0426 21:15:03.495201 108 model_config_utils.cc:680] Server side auto-completed config: name: "phi2"
max_batch_size: 1024
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  allow_ragged_batch: true
}
input {
  name: "request_output_len"
  data_type: TYPE_INT32
  dims: 1
}
output {
  name: "output_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "sequence_length"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "cum_log_probs"
  data_type: TYPE_FP32
  dims: -1
}
output {
  name: "output_log_probs"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "context_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "generation_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_GPU
}
parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value {
    string_value: "no"
  }
}
parameters {
  key: "batch_scheduler_policy"
  value {
    string_value: "guaranteed_no_evict"
  }
}
parameters {
  key: "enable_chunked_context"
  value {
    string_value: "false"
  }
}
parameters {
  key: "enable_kv_cache_reuse"
  value {
    string_value: "${enable_kv_cache_reuse}"
  }
}
parameters {
  key: "enable_trt_overlap"
  value {
    string_value: "false"
  }
}
parameters {
  key: "exclude_input_in_output"
  value {
    string_value: "true"
  }
}
parameters {
  key: "gpt_model_path"
  value {
    string_value: "/tensorrtllm_backend/models/phi2/phi2/1"
  }
}
parameters {
  key: "gpt_model_type"
  value {
    string_value: "inflight_fused_batching"
  }
}
parameters {
  key: "gpu_device_ids"
  value {
    string_value: "${gpu_device_ids}"
  }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters {
  key: "max_attention_window_size"
  value {
    string_value: "${max_attention_window_size}"
  }
}
parameters {
  key: "max_beam_width"
  value {
    string_value: "${max_beam_width}"
  }
}
parameters {
  key: "max_tokens_in_paged_kv_cache"
  value {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters {
  key: "normalize_log_probs"
  value {
    string_value: "true"
  }
}
backend: "tensorrtllm"
model_transaction_policy {
}

I0426 21:15:03.495360 108 model_lifecycle.cc:469] loading: phi2:1
I0426 21:15:03.495473 107 model_config_utils.cc:680] Server side auto-completed config: name: "ensemble"
platform: "ensemble"
max_batch_size: 1024
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "max_tokens"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: -1
}
ensemble_scheduling {
  step {
    model_name: "preprocessing"
    model_version: -1
    input_map {
      key: "QUERY"
      value: "text_input"
    }
    input_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "max_tokens"
    }
    output_map {
      key: "INPUT_ID"
      value: "_INPUT_ID"
    }
    output_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "_REQUEST_OUTPUT_LEN"
    }
  }
  step {
    model_name: "phi2"
    model_version: -1
    input_map {
      key: "input_ids"
      value: "_INPUT_ID"
    }
    input_map {
      key: "request_output_len"
      value: "_REQUEST_OUTPUT_LEN"
    }
    output_map {
      key: "output_ids"
      value: "_TOKENS_BATCH"
    }
  }
  step {
    model_name: "postprocessing"
    model_version: -1
    input_map {
      key: "TOKENS_BATCH"
      value: "_TOKENS_BATCH"
    }
    output_map {
      key: "OUTPUT"
      value: "text_output"
    }
  }
}

I0426 21:15:03.495545 108 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0426 21:15:03.495573 108 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
I0426 21:15:03.495790 107 model_config_utils.cc:680] Server side auto-completed config: name: "phi2"
max_batch_size: 1024
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  allow_ragged_batch: true
}
input {
  name: "request_output_len"
  data_type: TYPE_INT32
  dims: 1
}
output {
  name: "output_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "sequence_length"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "cum_log_probs"
  data_type: TYPE_FP32
  dims: -1
}
output {
  name: "output_log_probs"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "context_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "generation_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_GPU
}
parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value {
    string_value: "no"
  }
}
parameters {
  key: "batch_scheduler_policy"
  value {
    string_value: "guaranteed_no_evict"
  }
}
parameters {
  key: "enable_chunked_context"
  value {
    string_value: "false"
  }
}
parameters {
  key: "enable_kv_cache_reuse"
  value {
    string_value: "${enable_kv_cache_reuse}"
  }
}
parameters {
  key: "enable_trt_overlap"
  value {
    string_value: "false"
  }
}
parameters {
  key: "exclude_input_in_output"
  value {
    string_value: "true"
  }
}
parameters {
  key: "gpt_model_path"
  value {
    string_value: "/tensorrtllm_backend/models/phi2/phi2/1"
  }
}
parameters {
  key: "gpt_model_type"
  value {
    string_value: "inflight_fused_batching"
  }
}
parameters {
  key: "gpu_device_ids"
  value {
    string_value: "${gpu_device_ids}"
  }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters {
  key: "max_attention_window_size"
  value {
    string_value: "${max_attention_window_size}"
  }
}
parameters {
  key: "max_beam_width"
  value {
    string_value: "${max_beam_width}"
  }
}
parameters {
  key: "max_tokens_in_paged_kv_cache"
  value {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters {
  key: "normalize_log_probs"
  value {
    string_value: "true"
  }
}
backend: "tensorrtllm"
model_transaction_policy {
}

I0426 21:15:03.496054 107 model_config_utils.cc:680] Server side auto-completed config: name: "postprocessing"
max_batch_size: 1024
input {
  name: "TOKENS_BATCH"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "OUTPUT"
  data_type: TYPE_STRING
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "skip_special_tokens"
  value {
    string_value: "True"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/phi/phi2"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0426 21:15:03.496294 107 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing"
max_batch_size: 1024
input {
  name: "QUERY"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
input {
  name: "BAD_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "STOP_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WORDS"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WEIGHTS"
  data_type: TYPE_FP32
  dims: -1
  optional: true
}
input {
  name: "END_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
input {
  name: "PAD_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
output {
  name: "INPUT_ID"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "add_special_tokens"
  value {
    string_value: "False"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/phi/phi2"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0426 21:15:03.496412 107 model_lifecycle.cc:469] loading: preprocessing:1
I0426 21:15:03.496460 107 model_lifecycle.cc:469] loading: postprocessing:1
I0426 21:15:03.496508 107 model_lifecycle.cc:469] loading: phi2:1
I0426 21:15:03.496628 107 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0426 21:15:03.496633 107 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0426 21:15:03.496665 107 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0426 21:15:03.496703 107 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0426 21:15:03.497955 107 python_be.cc:2075] 'python' TRITONBACKEND API version: 1.18
I0426 21:15:03.497971 107 python_be.cc:2097] backend configuration:
{"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}
I0426 21:15:03.498001 107 python_be.cc:2236] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0426 21:15:03.498181 107 python_be.cc:2559] TRITONBACKEND_GetBackendAttribute: setting attributes
I0426 21:15:03.498266 107 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
I0426 21:15:03.507194 107 python_be.cc:2337] TRITONBACKEND_ModelInitialize: postprocessing (version 1)
I0426 21:15:03.507335 107 python_be.cc:2337] TRITONBACKEND_ModelInitialize: preprocessing (version 1)
I0426 21:15:03.507584 107 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0426 21:15:03.507606 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_priority_level
I0426 21:15:03.507612 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0426 21:15:03.507619 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0426 21:15:03.507625 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_levels
I0426 21:15:03.507628 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::key
I0426 21:15:03.507632 107 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0426 21:15:03.507635 107 model_config_utils.cc:1904]   ModelConfig::ensemble_scheduling::step::model_version
I0426 21:15:03.507640 107 model_config_utils.cc:1904]   ModelConfig::input::dims
I0426 21:15:03.507645 107 model_config_utils.cc:1904]   ModelConfig::input::reshape::shape
I0426 21:15:03.507651 107 model_config_utils.cc:1904]   ModelConfig::instance_group::secondary_devices::device_id
I0426 21:15:03.507655 107 model_config_utils.cc:1904]   ModelConfig::model_warmup::inputs::value::dims
I0426 21:15:03.507661 107 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0426 21:15:03.507664 107 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0426 21:15:03.507668 107 model_config_utils.cc:1904]   ModelConfig::output::dims
I0426 21:15:03.507672 107 model_config_utils.cc:1904]   ModelConfig::output::reshape::shape
I0426 21:15:03.507677 107 model_config_utils.cc:1904]   ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0426 21:15:03.507681 107 model_config_utils.cc:1904]   ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0426 21:15:03.507687 107 model_config_utils.cc:1904]   ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0426 21:15:03.507691 107 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::dims
I0426 21:15:03.507697 107 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::initial_state::dims
I0426 21:15:03.507700 107 model_config_utils.cc:1904]   ModelConfig::version_policy::specific::versions
I0426 21:15:03.507817 107 python_be.cc:2031] model configuration:
{
    "name": "postprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "TOKENS_BATCH",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT",
            "data_type": "TYPE_STRING",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "postprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/phi/phi2"
        },
        "tokenizer_type": {
            "string_value": "auto"
        },
        "skip_special_tokens": {
            "string_value": "True"
        }
    },
    "model_warmup": []
}
I0426 21:15:03.507833 107 python_be.cc:2031] model configuration:
{
    "name": "preprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "QUERY",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "BAD_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "STOP_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WORDS",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WEIGHTS",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "END_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "PAD_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "INPUT_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "preprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/phi/phi2"
        },
        "tokenizer_type": {
            "string_value": "auto"
        },
        "add_special_tokens": {
            "string_value": "False"
        }
    },
    "model_warmup": []
}
I0426 21:15:03.527326 107 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0426 21:15:03.527339 107 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0426 21:15:03.527383 107 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py'
I0426 21:15:03.527441 107 backend_model_instance.cc:69] Creating instance postprocessing_0_0 on CPU using artifact 'model.py'
I0426 21:15:03.528593 107 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/phi2/postprocessing/1/model.py prefix0_2 1048576 1048576 107 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT
I0426 21:15:03.528645 107 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/phi2/preprocessing/1/model.py prefix0_1 1048576 1048576 107 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT
I0426 21:15:03.556734 108 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0426 21:15:03.556763 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_priority_level
I0426 21:15:03.556767 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0426 21:15:03.556771 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0426 21:15:03.556775 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_levels
I0426 21:15:03.556778 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::key
I0426 21:15:03.556782 108 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0426 21:15:03.556786 108 model_config_utils.cc:1904]   ModelConfig::ensemble_scheduling::step::model_version
I0426 21:15:03.556789 108 model_config_utils.cc:1904]   ModelConfig::input::dims
I0426 21:15:03.556793 108 model_config_utils.cc:1904]   ModelConfig::input::reshape::shape
I0426 21:15:03.556796 108 model_config_utils.cc:1904]   ModelConfig::instance_group::secondary_devices::device_id
I0426 21:15:03.556799 108 model_config_utils.cc:1904]   ModelConfig::model_warmup::inputs::value::dims
I0426 21:15:03.556803 108 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0426 21:15:03.556807 108 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0426 21:15:03.556811 108 model_config_utils.cc:1904]   ModelConfig::output::dims
I0426 21:15:03.556814 108 model_config_utils.cc:1904]   ModelConfig::output::reshape::shape
I0426 21:15:03.556818 108 model_config_utils.cc:1904]   ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0426 21:15:03.556821 108 model_config_utils.cc:1904]   ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0426 21:15:03.556825 108 model_config_utils.cc:1904]   ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0426 21:15:03.556828 108 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::dims
I0426 21:15:03.556832 108 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::initial_state::dims
I0426 21:15:03.556835 108 model_config_utils.cc:1904]   ModelConfig::version_policy::specific::versions
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'lora_target_modules' not found
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'lora_target_modules' not found
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I0426 21:15:04.084264 107 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)
I0426 21:15:04.084469 107 backend_model_instance.cc:772] Starting backend thread for postprocessing_0_0 at nice 0 on device 0...
I0426 21:15:04.084696 107 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0426 21:15:04.088260 107 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful preprocessing_0_0 (device 0)
I0426 21:15:04.088398 107 backend_model_instance.cc:772] Starting backend thread for preprocessing_0_0 at nice 0 on device 0...
I0426 21:15:04.088554 107 model_lifecycle.cc:835] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 2778 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 2778 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2833, GPU 3768 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2835, GPU 3778 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2833, GPU 3768 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2835, GPU 3778 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2776, now: CPU 0, GPU 2776 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2776, now: CPU 0, GPU 2776 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3010, GPU 5264 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3010, GPU 5272 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3010, GPU 5264 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3010, GPU 5272 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2776 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2776 (MiB)
[TensorRT-LLM][INFO] Allocate 71491911680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 436352 total tokens in paged KV cache, and 24 blocks per sequence
[TensorRT-LLM][INFO] Allocate 71491911680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 436352 total tokens in paged KV cache, and 24 blocks per sequence
I0426 21:15:07.440169 107 backend_model_instance.cc:772] Starting backend thread for phi2_0_0 at nice 0 on device 0...
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'lora_target_modules' not found
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 2778 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3041, GPU 76294 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 3042, GPU 76304 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2776, now: CPU 0, GPU 5552 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3042, GPU 76968 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3042, GPU 76976 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5552 (MiB)

After a few minutes passed, I tried calling the endpoint but it seems it doesn't work because it is hung up:

$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20}'
curl: (56) Recv failure: Connection reset by peer

Additional Notes

I used Phi-2 from here: https://huggingface.co/microsoft/phi-2

I wonder if this is similar to this other issue? - https://github.com/triton-inference-server/tensorrtllm_backend/issues/377

Would be great if Nvidia can repro this.

hijkzzz commented 6 months ago

Hi, for Phi-2. please use the commands:

python ./convert_checkpoint.py --model_dir "microsoft/phi-2" --output_dir ./phi-2-checkpoint --dtype float16

--tp_size should be set with trtllm-build

trtllm-build \
    --checkpoint_dir ./phi-2-checkpoint \
    --output_dir ./phi-2-engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_output_len 1024 \
    --tp_size 2 \
    --workers 2

Because of historical reasons, this model has some special usage cases.

kelkarn commented 6 months ago

@hijkzzz - that does not work for me; I get this error:

usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE]
                    [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}]
                    [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN]
                    [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS]
                    [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits]
                    [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}]
                    [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}]
                    [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}]
                    [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}]
                    [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}]
                    [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}]
                    [--use_context_fmha_for_generation {enable,disable}]
trtllm-build: error: unrecognized arguments: --tp_size 2

I am using TRT-LLM v0.8.0 in a 24.02-trtllm-python-py3 Triton container.

hijkzzz commented 6 months ago

@hijkzzz - that does not work for me; I get this error:

usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE]
                    [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}]
                    [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN]
                    [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS]
                    [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits]
                    [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}]
                    [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}]
                    [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}]
                    [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}]
                    [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}]
                    [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}]
                    [--use_context_fmha_for_generation {enable,disable}]
trtllm-build: error: unrecognized arguments: --tp_size 2

I am using TRT-LLM v0.8.0 in a 24.02-trtllm-python-py3 Triton container.

please use the latest TRT-LLM

kelkarn commented 6 months ago

@hijkzzz is that compatible with Triton 24.02? The support matrix says that only version v0.8.0 of TRT-LLM is compatible with Triton 24.02.

kelkarn commented 6 months ago

@byshiue - can you please help me understand what the resolution here is? Are we saying that the Phi-2 with TRT-LLM v0.8.0 and with A100 (160GB) and on Triton server 24.02 is not expected to work?