Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.95k stars 733 forks source link

Address frozen parameter warning with FSDP on nightly torch #1392

Open carmocca opened 3 weeks ago

carmocca commented 3 weeks ago

PEFT finetuning (LoRA, adapter) raises the following warning for each FSDP-wrapped layer (transformer block in our case):

The following parameters have requires_grad=True:
['transformer.h.0.attn.attn.lora_A', 'transformer.h.0.attn.attn.lora_B']
The following parameters have requires_grad=False:
['transformer.h.0.norm_1.weight', 'transformer.h.0.norm_1.bias', 'transformer.h.0.norm_2.weight', 'transformer.h.0.norm_2.bias', 'transformer.h.0.attn.attn.linear.weight', 'transformer.h.0.attn.attn.linear.bias', 'transformer.h.0.attn.proj.linear.weight', 'transformer.h.0.attn.proj.linear.bias', 'transformer.h.0.mlp.fc.linear.weight', 'transformer.h.0.mlp.fc.linear.bias', 'transformer.h.0.mlp.proj.linear.weight', 'transformer.h.0.mlp.proj.linear.bias']
  warnings.warn(msg)
/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py:174: UserWarning: transformer.h.1 has both parameters with requires_grad=True and False. We do not recommend wrapping such modules since the gradient memory usage will be higher than expected (201510912 numel instead of 131072 numel before sharding via reduce-scatter). If possible, wrap the frozen parameters with FSDP separately.

This should be looked at or silenced if we don't want to action on it

RuABraun commented 2 weeks ago

Is changing the code so the lora parameters are in a separate module an option? I don't see how you can otherwise wrap the lora parameters into a separate FSDP unit. I might be able to help.