TorchServe

TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.

Requires python >= 3.8

curl http://127.0.0.1:8080/predictions/bert -T input.txt

🚀 Quick start with TorchServe

# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly

🚀 Quick start with TorchServe (conda)

# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver

Getting started guide

🐳 Quick Start with Docker

# Latest release
docker pull pytorch/torchserve

# Nightly build
docker pull pytorch/torchserve-nightly

Refer to torchserve docker for details.

⚡ Why TorchServe

Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, Nvidia MPS
Model Management API: multi model management with optimized worker to model allocation
Inference API: REST and gRPC support for batched inference
TorchServe Workflows: deploy complex DAGs with multiple interdependent models
Default way to serve PyTorch models in
- Sagemaker
- Vertex AI
- Kubernetes with support for autoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
- Kserve: Supports both v1 and v2 API, autoscaling and canary deployments for A/B testing
- Kubeflow
- MLflow
Export your model for optimized inference. Torchscript out of the box, PyTorch Compiler preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
Performance Guide: builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your use case with many supported out of the box
Metrics API: out-of-the-box support for system-level metrics with Prometheus exports, custom metrics,
Large Model Inference Guide: With support for GenAI, LLMs including
- SOTA GenAI performance using torch.compile
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
- PyTorch Tensor Parallel preview, Pipeline Parallel
- Microsoft DeepSpeed, DeepSpeed-Mii
- Hugging Face Accelerate, Diffusers
- Running large models on AWS Sagemaker and Inferentia2
- Running Llama 2 Chatbot locally on Mac
Monitoring using Grafana and Datadog

🤔 How does TorchServe work

Model Server for PyTorch Documentation: Full documentation
TorchServe internals: How TorchServe was built
Contributing guide: How to contribute to TorchServe

🏆 Highlighted Examples

Serving Llama 2 with TorchServe
Chatbot with Llama 2 on Mac 🦙💬
🤗 HuggingFace Transformers with a Better Transformer Integration/ Flash Attention & Xformer Memory Efficient
Stable Diffusion
Model parallel inference
MultiModal models with MMF combining text, audio and video
Dual Neural Machine Translation for a complex workflow DAG
TorchServe Integrations
TorchServe Internals
TorchServe UseCases

For more examples

🤓 Learn More

https://pytorch.org/serve

🫂 Contributing

We welcome all contributions!

To learn more about how to contribute, see the contributor guide here.

📰 News

💖 All Contributors

Made with contrib.rocks.

⚖️ Disclaimer

This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com. For all other questions, please open up an issue in this repository here.

TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived

2lambda123 / pytorch-serve

readme