upgrades framework to perform OR logic when activating plugins
creates a FastKernelsAccelerationPlugin that is an improved version over FastQuantizedPeftAccelerationPlugin
it can add kernels individually
it can be activated under an training stanza or a peft.quantized stanza
Add FOAK support to Full-Finetuning and Standard PEFT benchmarks
FOAK support on 1 additional models
GPTBigCode
Note that due to GPTBigCode architecture limitations only FastCrossEntropyLoss is supported in this PR. Additional support will be tracked [placeholder issue]
Bug fix to ModelPatcher to address multiple reloads to the same target path
This affected the proper patching of FastCrossEntropyLoss
Improvements to Full Finetuning
7% Improvement from following kernels (FastCrossEntropyLoss, FastRMSNorm, FastRoPE)
Framework
Model
num gpus
batch size
throughput (toks/s)
Improvement %
fullFT
Mistral7B
1
4
2910
base
foak-fullFT
Mistral7B
1
4
3218
10.5
PEFT
Mistral7B
1
4
3345
base
foak-PEFT
Mistral7B
1
4
3797
13.5
Framework
Model
num gpus
batch size
throughput (toks/s)
Improvement %
fullFT
Mistral7B
2
4
2886
base
foak-fullFT
Mistral7B
2
4
3093
7
PEFT
Mistral7B
2
4
3227
base
foak-PEFT
Mistral7B
2
4
3620
12
Compatibility Matrix with Mixed Precision
torch_dtype
Mixed Precision
Full-FT-FOAK
PEFT-FOAK
QPEFT-FOAK
FLOAT16
-
✗ Not Allowed
✗
✗
FLOAT16
FP16
ValueError: Attempting to unscale FP16 gradients. See here
Running our alpaca benchmarks for most experiments in bfloat16 (except GPTQ-LoRA in float16. See issue). We see no significant regression in performance.
_Note an outlier in the comparison plots show an anomalous memory increase in a standard full-FT experiment on Mistral7B with no accelerations installed. Since it does not point to any issues with the code in this PR, it might be caused by some slight instability of the benchmarking run._
Bug Fix to Model Patcher
There is no significant change in performance of FOAK from the fix for the improper patching of FastCrossEntropyLoss, however there is a slight decrease in improvement observed (consistent with issue 70) compared to previous paddingfree+foak numbers.
Note:
Due to issues with FSDP-QLoRA in the latest transformers version (4.45.0dev) mentioned here, Granite with Fast Kernels will be addressed in a later PR instead.
TODO
add the activation (e.g. SWIGLU) kernels to FastKernelsAccelerationPlugin. Follow the pattern of building the fused-lora rule for a base_type.
add chunked loss (optional). If not done create issue.
Description
This PR
FastKernelsAccelerationPlugin
that is an improved version overFastQuantizedPeftAccelerationPlugin
training
stanza or apeft.quantized
stanzaImprovements to Full Finetuning
Attempting to
unscale FP16 gradients.
See here
Regression Test for Loss, Memory, Throughput
_Note an outlier in the comparison plots show an anomalous memory increase in a standard full-FT experiment on Mistral7B with no accelerations installed. Since it does not point to any issues with the code in this PR, it might be caused by some slight instability of the benchmarking run._
Bug Fix to Model Patcher
There is no significant change in performance of FOAK from the fix for the improper patching of FastCrossEntropyLoss, however there is a slight decrease in improvement observed (consistent with issue 70) compared to previous paddingfree+foak numbers.
FLAN (6000 samples) with PaddingFree
Note: Due to issues with FSDP-QLoRA in the latest transformers version (
4.45.0dev
) mentioned here, Granite with Fast Kernels will be addressed in a later PR instead.TODO
FastKernelsAccelerationPlugin
. Follow the pattern of building the fused-lora rule for abase_type
.