foundation-model-stack / fms-acceleration

🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
Apache License 2.0
0 stars 4 forks source link

FMS Acceleration 🚀

FMS Acceleration is designed to accelerate the fine-tuning and training of large models. This framework comprises a collection of libraries intended to be used with the fms-hf-tuning suite.

The fms-acceleration framework includes accelerators for Full and Parameter Efficient Fine Tuning (PEFT), including

Our tests show a significant increase in training token throughput using this fms-acceleration framework.

For example:

The above includes numbers using fusedOps-and-kernels and actual impl coming soon, see below.

This package is in BETA and is under development. Expect breaking changes!

Plugins

Plugin Description Depends License Status
framework This acceleration framework for integration with huggingface trainers Beta
accelerated-peft For PEFT-training, e.g., 4bit QLoRA. Huggingface
AutoGPTQ
Apache 2.0
MIT
Beta
fused-op-and-kernels Fused LoRA and triton kernels (e.g., fast cross-entropy, rms, rope) -- Apache 2.0 (contains extracted code) Beta
MOE-training-acceleration MegaBlocks inspired triton Kernels and acclerations for Mixture-of-Expert models Apache 2.0 Coming Soon

Usage with FMS HF Tuning

Below we demonstrate how to accelerate your tuning experience with tuning/sft_trainer.py from fms-hf-tuning.

Note: New exciting plugins will be added over time, so please check here for the latest accelerations!.

Integration with FMS HF Tuning

fms-acceleration is part of fms-hf-tuning, and instructions to utilize fms-acceleration for tuning are found here. In particular, fms-acceleration plugins can be accessed via command line arguments to fms-hf-tuning (e.g., --auto_gptq triton_v2); this is made available via integrated configuration dataclasses that configures the AccelerationFramework for the user.

Need for an alternative way to access features pre-integration

As new plugins become available, more command line arguments will be made avaiable to fms-hf-tuning to enable them. However, this kind of integration takes time; plugins that are in development / research stages may not be immediately integrated.

Therefore, an intermediary step is required to access plugins in fms-acceleration before they become integrated into fms-hf-tuning. In fact, such a method is critical for benchmarking / testing, that needs to happen before integration of any plugin in fms-hf-tuning can even be considered. Hence, we provide a method to configure the acceleration framework via a configuration YAML, that is passed into AccelerationFramework via an environment variable; the instructions for this is provided below. Futhermore, experienced users can also leverage this to early test plugins, but be warned that the learning curve to use these plugins is high (since it requires knowledge on how to write such a configuration). To aid on this, the following instructions are provide that describes both a basic and advanced flow.

FMS Acceleration Via Configuration YAML

Note: As mentioned above, the recommended approach for fms-hf-tuning is to use the acceleration config dataclasses. This method documented for the configuration YAML is only for testing/research purposes and not recommended for production. For general use, please refer instead to the instructions here.

Below we illustrate a configuration YAML flow using the accelerated quantised PEFT using GPTQ-LoRA tuning with the AutoGPTQ triton_v2 kernel use case; this kernel is state-of-the-art provided by jeromeku on Mar 2024:

There is both a basic and advanced usage for the configuration YAML flow.

Usage Flows

Basic Configuration YAML Flow 🤡

Most users of fms-hf-tuning only require the basic flow:

In this case then the basic flow comprises of 3 steps:

  1. First go to fms-hf-tuning and install the framework library:

    $ pip install -e .[fms-accel]

    or alternatively install the framework directly:

    $ pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework

    The above installs the command line utility fms_acceleration.cli, which is used to install plugins (and also other things like view sample configurations).

  2. install the required framework plugins; we install the fms-acceleration-peft plugin for GPTQ-LoRA tuning with triton v2 as:

    python -m fms_acceleration.cli install fms_acceleration_peft

    The above is the equivalent of:

    pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft
  3. Run sft_trainer.py providing the acceleration configuration (via the environment variable ACCELERATION_FRAMEWORK_CONFIG_FILE and arguments; given the basic flow assumption that we simply re-use the same sft_trainer.py arguments as we had without using the fms_acceleration package:

    # when using sample-configurations, arguments can be referred from
    # defaults.yaml and scenarios.yaml
    ACCELERATION_FRAMEWORK_CONFIG_FILE=framework.yaml \
    python sft_trainer.py \
        ...  # arguments

    The framework activates relevant plugins given the framework configuration; for more details see framework/README.md.

    Activate TRANSFORMERS_VERBOSITY=info to see the huggingface trainer printouts and verify that AccelerationFramework is activated!

    # this printout will be seen in huggingface trainer logs if acceleration is activated
    ***** FMS AccelerationFramework *****
    Active Plugin: AutoGPTQAccelerationPlugin. Python package: fms_acceleration_peft. Version: 0.0.1.
    ***** Running training *****
    Num examples = 1,549
    Num Epochs = 1
    Instantaneous batch size per device = 4
    Total train batch size (w. parallel, distributed & accumulation) = 4
    Gradient Accumulation steps = 1
    Total optimization steps = 200
    Number of trainable parameters = 13,631,488

Advanced Configuration YAML Flow 🥷 🦹

The advanced flow makes further use of fms_acceleration.cli to:

The advanced flow comprises of 5 steps:

  1. Same as Step 1 of basic flow.
  2. Use fms_acceleration.cli configs to search for sample configs:

    $ python -m fms_acceleration.cli configs
    
    1. accelerated-peft-autogptq (accelerated-peft-autogptq-sample-configuration.yaml) - plugins: ['accelerated-peft']
    2. accelerated-peft-bnb (accelerated-peft-bnb-nf4-sample-configuration.yaml) - plugins: ['accelerated-peft']

    This is equivalent to the searching over the:

  3. install plugins same as Step 2 of basic flow, noting that in addition we can use plugins to display all available plugins; this list updates as more plugins get developed. Recall that configs list the required plugins for the sample configurations; make sure all of them are installed.

    $ python -m fms_acceleration.cli plugins
    
    Choose from the list of plugin shortnames, and do:
    * 'python -m fms_acceleration.cli install <pip-install-flags> PLUGIN_NAME'.
    
    List of PLUGIN_NAME [PLUGIN_SHORTNAME]:
    
    1. fms_acceleration_peft [peft]

    After install the list will update to indicate the installed plugins.

  4. Get the correct arguments for sft_trainer.py:

    • arguments required for correct operation (e.g., if using accelerated peft, then peft_method is required).

      • Use arguments along with the sample configuration shortname to display the relevant critical arguments; these arguments can be manually referred from scenarios.yaml:
        
        $ python -m fms_acceleration.cli arguments accelerated-peft-autogptq

      Searching for configuration shortnames: ['accelerated-peft-autogptq']

      1. scenario: accelerated-peft-gptq configs: accelerated-peft-autogptq arguments: --learning_rate 2e-4 \ --fp16 True \ --torch_dtype float16 \ --peft_method lora \ --r 16 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --target_modules ['q_proj', 'k_proj', 'v_proj', 'o_proj']
    • More info on defaults.yaml and scenarios.yaml found here.

      • Arguments not critical to the plugins found in defaults.yaml. These can be taken with liberty.
      • Arguments critcal to plugins found in scenarios.yaml. The relevant section of scenarios.yaml, is the one whose framework_config entries, match the shortname of the sample configuration of interest.

CUDA Dependencies

This repo requires CUDA to compute the kernels, and it is convinient to use NVidia Pytorch Containers that already comets with CUDA installed. We have tested with the following versions:

Benchmarks

The benchmarks can be reproduced with the provided scripts.

See below CSV files for various results:

Code Architecture

For deeper dive into details see framework/README.md.

Maintainers

IBM Research, Singapore