ServiceNow/Fast-LLM - Githubissues

[![Docker][ci-badge]][ci-workflow] [![Documentation][docs-badge]][docs-workflow] [![License][license-badge]][license] *Accelerating your LLM training to full speed* Made with ❤️ by [ServiceNow Research][servicenow-research]

Overview

Fast-LLM is a cutting-edge open-source library for training large language models with exceptional speed, scalability, and flexibility. Built on PyTorch and Triton, Fast-LLM empowers AI teams to push the limits of generative AI, from research to production.

Optimized for training models of all sizes—from small 1B-parameter models to massive clusters with 70B+ parameters—Fast-LLM delivers faster training, lower costs, and seamless scalability. Its fine-tuned kernels, advanced parallelism techniques, and efficient memory management make it the go-to choice for diverse training needs.

As a truly open-source project, Fast-LLM allows full customization and extension without proprietary restrictions. Developed transparently by a community of professionals on GitHub, the library benefits from collaborative innovation, with every change discussed and reviewed in the open to ensure trust and quality. Fast-LLM combines professional-grade tools with unified support for GPT-like architectures, offering the cost efficiency and flexibility that serious AI practitioners demand.

[!NOTE] Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.

Why Fast-LLM?

🚀 Fast-LLM is Blazingly Fast:
- ⚡️ Optimized kernel efficiency and reduced overheads.
- 🔋 Optimized memory usage for best performance.
- ⏳ Minimizes training time and cost.
📈 Fast-LLM is Highly Scalable:
- 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
- 🔗 Supports sequence length parallelism to handle longer sequences effectively.
- 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
- 🎛️ Mixed precision training support for better performance.
- 🏋️‍♂️ Large batch training and gradient accumulation support.
- 🔄 Reproducible training with deterministic behavior.
🎨 Fast-LLM is Incredibly Flexible:
- 🤖 Compatible with all common language model architectures in a unified class.
- ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
- 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
- 🤗 Seamless integration with Hugging Face Transformers.
🎯 Fast-LLM is Super Easy to Use:
- 📦 Pre-built Docker images for quick deployment.
- 📝 Simple YAML configuration for hassle-free setup.
- 💻 Command-line interface for easy launches.
- 📊 Detailed logging and real-time monitoring features.
- 📚 Extensive documentation and practical tutorials (in progress).
🌐 Fast-LLM is Truly Open Source:
- ⚖️ Licensed under Apache 2.0 for maximum freedom to use Fast-LLM at work, in your projects, or for research.
- 💻 Transparently developed on GitHub with public roadmap and issue tracking.
- 🤝 Contributions and collaboration are always welcome!

Usage

We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.

For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file examples/mistral-4-node-benchmark.yaml is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.

[!NOTE] Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.

Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of 9,800 tokens/s/H100 (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.

Running Fast-LLM on a Slurm Cluster

Prerequisites

A Slurm cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
CUDA 12.1 or higher.
Dependencies: PyTorch, Triton, and Apex installed on all nodes.

Steps

Deploy the nvcr.io/nvidia/pytorch:24.07-py3 Docker image to all nodes (recommended), because it contains all the necessary dependencies.

Install Fast-LLM on all nodes:

sbatch <<EOF
#!/bin/bash
#SBATCH --nodes=$(scontrol show node | grep -c NodeName)
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=$(scontrol show node | grep -c NodeName)
#SBATCH --exclusive

srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
EOF

Use the example Slurm job script examples/fast-llm.sbat to submit the job to the cluster:
```
sbatch examples/fast-llm.sbat
```
Monitor the job's progress:
- Logs: Follow job_output.log and job_error.log in your working directory for logs.
- Status: Use squeue -u $USER to see the job status.

Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕

Running Fast-LLM on a Kubernetes Cluster

Prerequisites

A Kubernetes cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
KubeFlow installed.
Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.

Steps

Create a Kubernetes PersistentVolumeClaim (PVC) named fast-llm-home that will be mounted to /home/fast-llm in the container using examples/fast-llm-pvc.yaml:
```
kubectl apply -f examples/fast-llm-pvc.yaml
```
Create a PyTorchJob resource using the example configuration file examples/fast-llm.pytorchjob.yaml:
```
kubectl apply -f examples/fast-llm.pytorchjob.yaml
```
Monitor the job status:
- Use kubectl get pytorchjobs to see the job status.
- Use kubectl logs -f fast-llm-master-0 -c pytorch to follow the logs.

That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀

Next Steps

📖 Want to learn more? Check out our documentation for more information on how to use Fast-LLM.

🔨 We welcome contributions to Fast-LLM! Have a look at our contribution guidelines.

🐞 Something doesn't work? Open an issue!

License

Fast-LLM is licensed by ServiceNow, Inc. under the Apache 2.0 License. See LICENSE for more information.

Vulnerability Reporting

For security issues, email disclosure@servicenow.com. See our security policy.