feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels

achew010 commented 3 months ago

Description of the change

This PR adds two dataclass arguments to enable padding free and multipack for the sft_trainer.py, via the new fms acceleration attention-and-distributed-packing plugin and allows the current --fastkernels dataclass to support optimized full-finetuning:

--padding_free: technique to process multiple examples in single batch without adding padding tokens that waste compute.
--multipack: technique for multi-gpu training to balance out number of tokens processed in each device, to minimize waiting time.
--fast_kernels: Previously limited only for QPEFT (used to raise if not activated with --fast_lora), Now allows for optimized full/standard LoRA finetuning.

These are extremely effective methods to improve training throughputs:

see the section on benchmarks below. Currently, either padding free is used alone, or together with multipack. We do not currently support the option of using multipack alone.
padding free and multipack used in the instructlab (ILAB) work, see below on the section of the early version of this plugin. For general use when producing this plugin, we have greatly simplified the user interface

NOTE: adhering to the design of fms-acceleration, the new plugin is optional, and separately installed.

there is no dependency on the fms-acceleration-peft for GPTQ LoRA. However we have tested it with other plugins:
- accelerated-peft
- fused-ops-and-kernels

Notes on Padding Free

works for both single and multi-gpu.
works on both pretokenized and untokenized datasets
works with current transformers releases (<=4.43), when padding free is not yet integrated from our PR into Hugging Face: https://github.com/huggingface/transformers/pull/31629.
is forward-compatible with future transformers that natively supports padding-free (>= 4.44).
verified against the version found in HF main, merged in via PR https://github.com/huggingface/transformers/pull/31629.

Notes on Multipack

works only for multi-gpu.
currently only includes the version of multipack optimized for linear attention implementations like flash-attn.

Notes on FastKernels

currently supports FastCrossEntropyLoss, FastRoPE, FastRMSLayerNorm but will include SwiGLU and Liger Kernels (e.g. FusedCrossEntropyLoss) in the future
Works for full-finetuning, LoRA and QPEFT,
- pass --fast_kernels True True True on full finetuning/LoRA runs
- pass --fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True for GPTQ-LoRA
- pass --fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True for QLoRA
FastRoPE currently doesn't accept positional_ids but this issue will be addressed in the future

Benchmarks

PaddingFree and Multipack Benchmarks for Mistral 7B

Notes:

Shown below are the runtimes for running a subset of 6000 FLAN samples.
Tested two cases of per device batch sizes 4 and 8, for varying number gpus from 2 to 8
Verified that untokenized dataset produces the same improvements for paddingfree and multipack

Per Device Batch Size 4

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedups
full-FT	2	4	1537	baseline
padding-free	2	4	859	1.79 x
padding-free + multipack	2	4	751	2.05 x
full-FT	4	4	932	baseline
padding-free	4	4	483	1.93 x
padding-free + multipack	4	4	342	2.75 x
full-FT	8	4	551	baseline
padding-free	8	4	275	2.00 x
padding-free + multipack	8	4	163	3.38 x

Per Device Batch Size 8	Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)
full-FT	2	8	1722	baseline
padding-free	2	8	678	2.54 x
padding-free + multipack	2	8	603	2.86 x
full-FT	4	8	1025	baseline
padding-free	4	8	380	2.70 x
padding-free + multipack	4	8	289	3.55 x
full-FT	8	8	611	baseline
padding-free	8	8	215	2.84 x
padding-free + multipack	8	8	140	4.36 x

Verified Similar Improvements for Untokenized Dataset	Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)
full-FT	2	4	1516	baseline
padding-free	2	4	848	1.78x
padding-free + multipack	2	4	747	2.02x

Full Finetuning Benchmarks for Mistral 7B

Early Version Of This Plugin

We have an unofficial version with more features than our present release. @kmehant is currently using for ILAB work. It addition to the padding-free and multipack, it also has the additional two plugins below:

LossAccelerationPlugin: various methods of averaging losses across GPUs in distributed training, like straight per-token averaging.
MLPDropoutAccelerationPlugin. Adding Dolomite-style residual dropout to MLP units.

To use the early version a quick hack of sft_trainer with pretokenized + custom tokenizer: https://github.com/fabianlim/fms-hf-tuning/tree/attn-plugin . This will be superceded by this PR in the near future

Use with these command line arugments:

      --padding_free huggingface-injected \
      --loss_across_gpus mean token \

How to verify the PR

Additional checks/tests were added to

Ensures parsing --padding_free and multipack is correct in test_dataclass_parse_successfully
Ensures wrong arguments to --padding_free are caught in test_dataclass_will_fail_to_accept_illegal_args
Ensures Plugin is successfully instantiated from dataclass in test_framework_initialize_and_trains_with_aadp
Ensure --padding_free must be used with flash-attn, otherwise error is raised
Ensure --multi_pack must be used with --padding_free, otherwise error is raised
Ensure --packing True with --padding_free will raise an error
Ensure --fast_kernels works with full finetuning
Ensure that --fast_lora not called with either --auto_gptq or --bitsandbytes will raise an error

Ran the full suite of acceleration checks to verify all fms-acceleration unit tests passed

pytest tests/acceleration/

Was the PR tested

[x] I have added >=1 unit test(s) for every new method I have added.
[x] I have ensured all unit tests pass

fabianlim commented 2 months ago

@anhuong thanks for the review. For making this default, I drafted out various possibilities in this issue here https://github.com/foundation-model-stack/fms-hf-tuning/issues/334. We can discuss offline,

anhuong commented 2 months ago

Also we added the new automation that ensure PRs follow convention commits which you can see is failing -- https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10920573842/job/30310716778?pr=280 please address the change

anhuong commented 2 months ago

Please update the branch with the new changes from main and then once the experimental fields are updated this is good to merge in to me 👍

anhuong commented 2 months ago

Note @kmehant I think since you requested changes, an approval is needed from your side as well before this can merge

foundation-model-stack / fms-hf-tuning