Finetuning benchmarking experiments

This branch was used in running QLoRA benchmarking experiments using different GPU setups and configs. It also includes few LoRA experiments.

Added DDP as an option.
Default to 128 samples in alpaca_sample.
Added sql dataset for training.
Constant lr_scheduler is now a no-op to avoid the cpu-offloading overhead, which requires more investigation. In early experiments we've observed that having lr_scheduler.step() while using cpu-offloading slows down training by a factor of 1.5.
Added wd to adamw optimizer.
Refactor auto-wrap policy to its own function like other utilities.
Added a custom qlora module (adapted from PEFT), in early benchmarking experiments it is observed to be more memory efficient compared to PEFT but a longer training run might be required to verify there isn't any bugs.
Set SDPA as the attention implementation, with torch 2.2 release it supports flash-attn 2 implementation and best kernel will be picked by default. This might not be that necessary since HF already uses it as the default for the known models but it is more of a comment.

import torch
# Will throw kernel not available error when attn_mask is used.
# Optionally use the context manager to ensure one of the fused kernels is run
query = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda")
key = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda")
value = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda")
attn_mask = torch.ones(128, dtype=torch.bool, device="cuda")
with torch.backends.cuda.sdp_kernel(enable_flash=True, 
                                    enable_math=False, 
                                    enable_mem_efficient=False):
    torch.nn.functional.scaled_dot_product_attention(query,key,value,attn_mask=attn_mask)

Turn gradient clipping off by default, similar to lr_scheduler.step() this introduces a significant slow down (3x-4x) when used with cpu-offloading, llama-recipes uses a scheduler which is only stepped at the end of each epoch. For fancier schedulers like cosine annealer that step with param updates, this needs more investigation.

AnswerDotAI / fsdp_qlora

Finetuning benchmarking experiments #9