YunchaoYang commented 5 months ago

A few options to explore

NVIDIA NeMo, TensorRT_LLM, Triton

NeMo

Run this Generative AI example to build Lora with Gemma 2b and 7b

Run this example with Mistral 7B for PEFT

TensorRT/TensorRT_LLM
Triton Inference Server

vLLM This example shows how to use LLAMA and vLLM to build an app.

YunchaoYang commented 5 months ago

OpenLLM

星策社区一站式LLMops meetup

LiteLLM

similar to langchain, unified API, litellm 统一大模型接口

YunchaoYang commented 5 months ago

Trtion Inference Server

Handson

上手实战教程

1. Prepare the model repo

2. config model

the minimal necessary parameters configure served model

platform/backend: to define which backend to use
max_batch_size
input and output

For tensorRT, onnx file, tensorflow saved-model, do not require config.pbtxt when --strict-model-config=False. platform/backend: 用于指定后端，大部分情况二选一，特殊情况需要特殊对待，见后面。

max_batch_size: 指定最大batch。

input、output: 输入输出Tensor的名字和信息。

注意，对于Tensorrt，TensorFlow save-model，onnx模型，config.pbtxt不是必须的，只要启动指定 --strict-model-config=false

对于Tensorrt、onnxrt、pytorch，这两种参数二选一即可。

对于TensorFlow必须指定platform，backend可选。

对于openvino，python，dali，只能使用backend。

对于Custom，21.05版本之前，可以通过platform参数设置为custom表示；之后必须通过backend字段进行指定，值为你的custom backend library的名字。

几种参数设置情况

情况1： max_batch_size为一个大于0的常数，Input和output指定名字，数据类型，数据形状。注意：dims在指定的时候忽略batch_size的维度。

情况2： max_batch_size等于0。表示模型的输入和输出是不包括batch_size那个维度的。这个时候维度信息就是真实的维度信息。

情况3： pytorch特殊情况，torchscript模型不保存输入输出的名字，因此对输入输出名称有特殊规定，"字符串__数字"。支持可变shape，设置为-1。

情况4： reshape参数：对输入输出进行reshape。

version_policy参数，策略：

all：加载所有版本的模型。
latest：加载最新的模型（可多个，版本号越大越新）
specific：指定特定的版本。 The version policy in config.pbtxt, you can decide which one to serve.

Instance Groups

对应triton的并行计算能力特性，这个参数主要用来配置在指定设备上运行多个实例，提高模型服务能力，增加吞吐。

Instance Groups配置跑在同样设备上的一组模型实例。
count：同时开启的模型数量。
kind：指定设备类型。
gpus：指定GPU编号，如果不指定这个参数，triton会在每个GPU上跑相应数量的instance。

可配置多组。

Scheduling and Batching

Scheduling：指定调度策略来应对请求。

6.1 Default Scheduler

不做batching；输入进来是多少就按照多少去推理；

6.2 Dynamic Batcher

在服务端将多少个batch_size比较小的input_tensor合并为一个batch_size比较大的input_tensor；提高吞吐率的关键手段；只适合无状态模型；

子参数：

preferr_batch_size: 期望达到的batch_size是多少，多个值；
max_queue_delay_microseconds: 打成batch的时间限制，微秒；

高级子参数：

preserver_ordering: 请求进来的顺序和响应出去的顺序保持一致；
priority_levels: 定义不同优先级请求处理顺序；
Queue_Policy: 设置请求等待队列行为；

6.3 Sequence Batcher

专门用于stateful model的一种调度器；确保同一序列的推理请求能够路由到同样的模型实例上推理；

6.4 Ensemble Scheduler

组合不同的模块，形成pipeline；

Optimization Policy

Onnx模型优化——TRT backend for ONNX； TensorFlow模型优化——TF-TRT；

Model Warmup

指定模型热身的参数；

初始化可能延迟，直到收到前面几个推理请求；
热身完成后，Triton的服务才是Ready状态；
模型加载会变长；只有warmup以后，client端才能看到server ready

YunchaoYang commented 5 months ago

开启Triton容器

运行Triton服务

tritonserver --model-repository=<MODEL_REPO> 
               --log-verbose
                --strict-model-config
               --strict-readiness
               --exit-on-error <boolean>
               --http-port <integer>
               --grpc-port <integer>
               --metrics-port <integer>
               --model-control-mode <string>

检查服务是否准备就绪

curl -v <IP>:8000/v2/health/ready

其他启动选项

--log-verbose : 开启verbose日志信息。0 or >=1 --strict-model-config : 是否需要配置模型。true of false --strict-readiness : ready状态显示状况。 --exit-on-error : 模型加载部分失败，是否也启动。 --http-port : 指定http服务端口，默认是8000。 --grpc-port : 指定GRPC服务端口，默认是8001。 --metrics-port : metrics报告端口，默认8002。 --model-control-mode : 模型管理模式，默认是none，把模型库中所有模型都load进来，并且无法动态卸载或者更新。explicit，server启动时不加载模型，可以通过api进行加载或者卸载模型；poll，动态更新模型，增加新的版本或者修改配置，服务都会动态去加载模型。

other options

--repository-poll-secs : 模型控制模式为poll时，自动检查模型库变动的时间。 --load-model : 模型控制模式为explicit时指定启动时加载的模型。 --pinned-memory-pool-byte-size ：可以被Triton服务使用的锁页内存大小，关于锁页内存可以参考：https://cloud.tencent.com/developer/article/2000487。Default 256MB --cuda-memory-pool-byte-size <:>：可以被Triton使用的cuda memory 大小。default 64MB --backend-directory : backend搜索路径，可在使用custom backend的时候指定自己的库。 --repoagent-directory ：预处理模型库的库，譬如在load模型的时候进行加密。

curl -X POST http://localhost:8000/v2/repository/models/resnet50_pytorch/load
curl -X POST http://localhost:8000/v2/repository/models/resnet50_pytorch/unload

YunchaoYang commented 5 months ago

Configure an Ensemble Model (Pipeline)

ensemble notes

name matches of the output and input of sequential models
if model is stateful, need to container information in inference request
The model composing the ensemble have their own scheduler
If Model in ensember are all framework backends, data transmission between them does not have to go through GPU memory.