Fast-LLM is a cutting-edge open-source library for training large language models with exceptional speed, scalability, and flexibility. Built on PyTorch and Triton, Fast-LLM empowers AI teams to push the limits of generative AI, from research to production.
Optimized for training models of all sizes—from small 1B-parameter models to massive clusters with 70B+ parameters—Fast-LLM delivers faster training, lower costs, and seamless scalability. Its fine-tuned kernels, advanced parallelism techniques, and efficient memory management make it the go-to choice for diverse training needs.
As a truly open-source project, Fast-LLM allows full customization and extension without proprietary restrictions. Developed transparently by a community of professionals on GitHub, the library benefits from collaborative innovation, with every change discussed and reviewed in the open to ensure trust and quality. Fast-LLM combines professional-grade tools with unified support for GPT-like architectures, offering the cost efficiency and flexibility that serious AI practitioners demand.
[!NOTE] Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.
🚀 Fast-LLM is Blazingly Fast:
📈 Fast-LLM is Highly Scalable:
🎨 Fast-LLM is Incredibly Flexible:
🎯 Fast-LLM is Super Easy to Use:
🌐 Fast-LLM is Truly Open Source:
We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file examples/mistral-4-node-benchmark.yaml
is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
[!NOTE] Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of 9,800 tokens/s/H100 (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
Deploy the nvcr.io/nvidia/pytorch:24.07-py3 Docker image to all nodes (recommended), because it contains all the necessary dependencies.
Install Fast-LLM on all nodes:
sbatch <<EOF
#!/bin/bash
#SBATCH --nodes=$(scontrol show node | grep -c NodeName)
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=$(scontrol show node | grep -c NodeName)
#SBATCH --exclusive
srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
EOF
Use the example Slurm job script examples/fast-llm.sbat to submit the job to the cluster:
sbatch examples/fast-llm.sbat
Monitor the job's progress:
job_output.log
and job_error.log
in your working directory for logs.squeue -u $USER
to see the job status.Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
Create a Kubernetes PersistentVolumeClaim (PVC) named fast-llm-home
that will be mounted to /home/fast-llm
in the container using examples/fast-llm-pvc.yaml:
kubectl apply -f examples/fast-llm-pvc.yaml
Create a PyTorchJob resource using the example configuration file examples/fast-llm.pytorchjob.yaml:
kubectl apply -f examples/fast-llm.pytorchjob.yaml
Monitor the job status:
kubectl get pytorchjobs
to see the job status.kubectl logs -f fast-llm-master-0 -c pytorch
to follow the logs.That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
📖 Want to learn more? Check out our documentation for more information on how to use Fast-LLM.
🔨 We welcome contributions to Fast-LLM! Have a look at our contribution guidelines.
🐞 Something doesn't work? Open an issue!
Fast-LLM is licensed by ServiceNow, Inc. under the Apache 2.0 License. See LICENSE for more information.
For security issues, email disclosure@servicenow.com. See our security policy.