Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.85k stars 1.08k forks source link

Address frozen parameter warning with FSDP on nightly torch #1392

Open carmocca opened 7 months ago

carmocca commented 7 months ago

PEFT finetuning (LoRA, adapter) raises the following warning for each FSDP-wrapped layer (transformer block in our case):

The following parameters have requires_grad=True:
['transformer.h.0.attn.attn.lora_A', 'transformer.h.0.attn.attn.lora_B']
The following parameters have requires_grad=False:
['transformer.h.0.norm_1.weight', 'transformer.h.0.norm_1.bias', 'transformer.h.0.norm_2.weight', 'transformer.h.0.norm_2.bias', 'transformer.h.0.attn.attn.linear.weight', 'transformer.h.0.attn.attn.linear.bias', 'transformer.h.0.attn.proj.linear.weight', 'transformer.h.0.attn.proj.linear.bias', 'transformer.h.0.mlp.fc.linear.weight', 'transformer.h.0.mlp.fc.linear.bias', 'transformer.h.0.mlp.proj.linear.weight', 'transformer.h.0.mlp.proj.linear.bias']
  warnings.warn(msg)
/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py:174: UserWarning: transformer.h.1 has both parameters with requires_grad=True and False. We do not recommend wrapping such modules since the gradient memory usage will be higher than expected (201510912 numel instead of 131072 numel before sharding via reduce-scatter). If possible, wrap the frozen parameters with FSDP separately.

This should be looked at or silenced if we don't want to action on it

RuABraun commented 6 months ago

Is changing the code so the lora parameters are in a separate module an option? I don't see how you can otherwise wrap the lora parameters into a separate FSDP unit. I might be able to help.

MaxGonzalezSaez-Diez commented 4 months ago

Still occuring.