alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Apache License 2.0
437 stars 38 forks source link
gpt inference llama llm llm-serving llmops model-serving

English 中文

News

About

Features

Production Proven

Applied in numerous LLM scenarios, such as:

High Performance

Flexibility and Ease of Use

Advanced Acceleration Techniques

How to Use

Requirements

cd ../

start http service

TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server

request to server

curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'


2. whl
```bash
# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service

cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'

Docker Relelase Note

FAQ

  1. libcufft.so

    Error log: OSError: libcufft.so.11: cannot open shared object file: No such file or directory

    Resolution: Please check whether cuda and rtp-llm versions are matched

  2. libth_transformer.so

    Error log: OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory

    Resolution: If installed via whl or docker(which means not a bazel build), please check your current directory is not rtp-llm, or python will use relative path package instead of installed whl

  3. Bazel build time out

    Error log: ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)

    Resolution:

    1. change pip mirror repository in open_source/deps/pip.bzl, add extra_pip_args=["--index_url=xxx"]
    2. pip install requirements manually, especially for pytorch, for that bazel build has a 600s timeout by default, which may not be enough for pytorch downloading

Documentation

Acknowledgments

Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. FasterTransformer and TensorRT-LLM have provided us with reliable performance guarantees. Flash-Attention2 and cutlass have also provided a lot of help in our continuous performance optimization process. Our continuous batching and increment decoding draw on the implementation of vllm; sampling draws on transformers, with speculative sampling integrating Medusa's implementation, and the multimodal part integrating implementations from llava and qwen-vl. We thank these projects for their inspiration and help.

External Application Scenarios (Continuously Updated)

LLM + Multimodal

Contact Us

DingTalk Group

WeChat Group