FMS Acceleration 🚀

FMS Acceleration is designed to accelerate the fine-tuning and training of large models. This framework comprises a collection of libraries intended to be used with the fms-hf-tuning suite.

The fms-acceleration framework includes accelerators for Full and Parameter Efficient Fine Tuning (PEFT), including

Low Rank Adaptation (LoRA) acceleration (coming soon)
Bits-and-Bytes (BNB) quantised LoRA : QLoRA acceleration
AutoGPTQ quantised LoRA : GPTQ-LoRA acceleration
Full Fine Tuning acceleration (coming soon)

Our tests show a significant increase in training token throughput using this fms-acceleration framework.

For example:

QLoRA: 22-43 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
QLoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA
GPTQ-LoRA: 22-44 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
GPTQ-LoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA

The above includes numbers using fusedOps-and-kernels and actual impl coming soon, see below.

This package is in BETA and is under development. Expect breaking changes!

Plugins

Plugin	Description	Depends	License	Status
framework	This acceleration framework for integration with huggingface trainers			Beta
accelerated-peft	For PEFT-training, e.g., 4bit QLoRA.	Huggingface AutoGPTQ	Apache 2.0 MIT	Beta
fused-op-and-kernels	Fused LoRA and triton kernels (e.g., fast cross-entropy, rms, rope)	--	Apache 2.0 (contains extracted code)	Beta
MOE-training-acceleration	MegaBlocks inspired triton Kernels and acclerations for Mixture-of-Expert models		Apache 2.0	Coming Soon

Usage with FMS HF Tuning

Below we demonstrate how to accelerate your tuning experience with tuning/sft_trainer.py from fms-hf-tuning.

Note: New exciting plugins will be added over time, so please check here for the latest accelerations!.

Integration with FMS HF Tuning

fms-acceleration is part of fms-hf-tuning, and instructions to utilize fms-acceleration for tuning are found here. In particular, fms-acceleration plugins can be accessed via command line arguments to fms-hf-tuning (e.g., --auto_gptq triton_v2); this is made available via integrated configuration dataclasses that configures the AccelerationFramework for the user.

Need for an alternative way to access features pre-integration

As new plugins become available, more command line arguments will be made avaiable to fms-hf-tuning to enable them. However, this kind of integration takes time; plugins that are in development / research stages may not be immediately integrated.

Therefore, an intermediary step is required to access plugins in fms-acceleration before they become integrated into fms-hf-tuning. In fact, such a method is critical for benchmarking / testing, that needs to happen before integration of any plugin in fms-hf-tuning can even be considered. Hence, we provide a method to configure the acceleration framework via a configuration YAML, that is passed into AccelerationFramework via an environment variable; the instructions for this is provided below. Futhermore, experienced users can also leverage this to early test plugins, but be warned that the learning curve to use these plugins is high (since it requires knowledge on how to write such a configuration). To aid on this, the following instructions are provide that describes both a basic and advanced flow.

FMS Acceleration Via Configuration YAML

Note: As mentioned above, the recommended approach for fms-hf-tuning is to use the acceleration config dataclasses. This method documented for the configuration YAML is only for testing/research purposes and not recommended for production. For general use, please refer instead to the instructions here.

Below we illustrate a configuration YAML flow using the accelerated quantised PEFT using GPTQ-LoRA tuning with the AutoGPTQ triton_v2 kernel use case; this kernel is state-of-the-art provided by jeromeku on Mar 2024:

There is both a basic and advanced usage for the configuration YAML flow.

Usage Flows

Basic Configuration YAML Flow 🤡

Most users of fms-hf-tuning only require the basic flow:

Assumption 1: user has an already prepared configuration, say from sample-configurations.
Assumption 2: user knows exactly what acceleration 'plugins` are required (based on the configuration).
Assumption 3: the arguments for running sft_trainer.py is the same; save for one extra argument --acceleration_framework_config_file used to pass in the acceleration config.

In this case then the basic flow comprises of 3 steps:

First go to fms-hf-tuning and install the framework library:
```
$ pip install -e .[fms-accel]
```
or alternatively install the framework directly:
```
$ pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
```
The above installs the command line utility fms_acceleration.cli, which is used to install plugins (and also other things like view sample configurations).

install the required framework plugins; we install the fms-acceleration-peft plugin for GPTQ-LoRA tuning with triton v2 as:

python -m fms_acceleration.cli install fms_acceleration_peft

The above is the equivalent of:

pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft

Run sft_trainer.py providing the acceleration configuration (via the environment variable ACCELERATION_FRAMEWORK_CONFIG_FILE and arguments; given the basic flow assumption that we simply re-use the same sft_trainer.py arguments as we had without using the fms_acceleration package:

# when using sample-configurations, arguments can be referred from
# defaults.yaml and scenarios.yaml
ACCELERATION_FRAMEWORK_CONFIG_FILE=framework.yaml \
python sft_trainer.py \
    ...  # arguments

The framework activates relevant plugins given the framework configuration; for more details see framework/README.md.

Activate TRANSFORMERS_VERBOSITY=info to see the huggingface trainer printouts and verify that AccelerationFramework is activated!

# this printout will be seen in huggingface trainer logs if acceleration is activated
***** FMS AccelerationFramework *****
Active Plugin: AutoGPTQAccelerationPlugin. Python package: fms_acceleration_peft. Version: 0.0.1.
***** Running training *****
Num examples = 1,549
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 200
Number of trainable parameters = 13,631,488

Advanced Configuration YAML Flow 🥷 🦹

The advanced flow makes further use of fms_acceleration.cli to:

list all available configs and acceleration plugins the configs depend on.
list all available plugins and check which are the installed ones.
identify critical sft_trainer arguments required for correct operation of a particular framework config.

The advanced flow comprises of 5 steps:

Same as Step 1 of basic flow.

Use fms_acceleration.cli configs to search for sample configs:

$ python -m fms_acceleration.cli configs

1. accelerated-peft-autogptq (accelerated-peft-autogptq-sample-configuration.yaml) - plugins: ['accelerated-peft']
2. accelerated-peft-bnb (accelerated-peft-bnb-nf4-sample-configuration.yaml) - plugins: ['accelerated-peft']

This is equivalent to the searching over the:

Full sample configuration list that shows plugins required for all available configs.
E.g., Accelerated GPTQ-LoRA configuration here.

install plugins same as Step 2 of basic flow, noting that in addition we can use plugins to display all available plugins; this list updates as more plugins get developed. Recall that configs list the required plugins for the sample configurations; make sure all of them are installed.
```
$ python -m fms_acceleration.cli plugins

Choose from the list of plugin shortnames, and do:
* 'python -m fms_acceleration.cli install <pip-install-flags> PLUGIN_NAME'.

List of PLUGIN_NAME [PLUGIN_SHORTNAME]:

1. fms_acceleration_peft [peft]
```
After install the list will update to indicate the installed plugins.
Get the correct arguments for sft_trainer.py:
- arguments required for correct operation (e.g., if using accelerated peft, then peft_method is required).
  - Use arguments along with the sample configuration shortname to display the relevant critical arguments; these arguments can be manually referred from scenarios.yaml:
```
$ python -m fms_acceleration.cli arguments accelerated-peft-autogptq
```
  Searching for configuration shortnames: ['accelerated-peft-autogptq']
  1. scenario: accelerated-peft-gptq configs: accelerated-peft-autogptq arguments: --learning_rate 2e-4 \ --fp16 True \ --torch_dtype float16 \ --peft_method lora \ --r 16 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --target_modules ['q_proj', 'k_proj', 'v_proj', 'o_proj']
- More info on defaults.yaml and scenarios.yaml found here.
  - Arguments not critical to the plugins found in defaults.yaml. These can be taken with liberty.
  - Arguments critcal to plugins found in scenarios.yaml. The relevant section of scenarios.yaml, is the one whose framework_config entries, match the shortname of the sample configuration of interest.

CUDA Dependencies

This repo requires CUDA to compute the kernels, and it is convinient to use NVidia Pytorch Containers that already comets with CUDA installed. We have tested with the following versions:

pytorch:24.01-py3

Benchmarks

The benchmarks can be reproduced with the provided scripts.

includes baseline benches (e.g., standard fine-tuning, standard peft).
benches for various acceleration sample configs.

See below CSV files for various results:

A100-80GB.

Code Architecture

For deeper dive into details see framework/README.md.

Maintainers

IBM Research, Singapore

Fabian Lim flim@sg.ibm.com
Aaron Chew aaron.chew1@sg.ibm.com
Laura Wynter lwynter@sg.ibm.com

foundation-model-stack / fms-acceleration

readme