cccntu / minLoRA

minLoRA: a minimal PyTorch library that allows you to apply LoRA to any PyTorch model.
MIT License
432 stars 29 forks source link

Freeze manually #2

Open G-JWLee opened 1 year ago

G-JWLee commented 1 year ago

Hi, thank you for your great work.

I want to use yours for my experiment.

I wonder get_lora_params() would load parameters to optimizer, but if the model itself can compute gradient, wouldn't the model still compute gradient?

Would be freezing the model enough for using minlora without the get_lora_params?

Also, when merging lora to the model to have another lora module, should I have to set lora_A and lora_B requires_grad=False before merging?

Thank you.

cccntu commented 1 year ago

Hi, thanks!

I wonder get_lora_params() would load parameters to optimizer, but if the model itself can compute gradient, wouldn't the model still compute gradient? Would be freezing the model enough for using minlora without the get_lora_params?

Probably yes, but you need to make sure you don't accidentally freeze the lora parameters.

Also, when merging lora to the model to have another lora module, should I have to set lora_A and lora_B requires_grad=False before merging?

Probably not. After merging, lora_A and lora_B will no longer exist.

G-JWLee commented 1 year ago

Thank you for your kind reply.

However, in the example in https://github.com/cccntu/LoRAnanoGPT/blob/master/train.py, line 236, it uses DDP without 'find_unused_parameters=True' argument. When I work on my own experiment on other setting with DDP, since backbone model has requires_grad=False, I get error message since backbone model parameters are not used for gradient computation when not specifying 'find_unused_parameters=True'. Is there something that I missed? I believe this API works with DDP.

Thnak you!

cccntu commented 1 year ago

Honestly I don't know. Can you solve it by simply adding 'find_unused_parameters=True'?

I've only used it on one GPU.

Or does using get_lora_parameter solve this issue?

justindachille commented 1 year ago

It looks like this method is correct in the sense that it only updates the parameters you pass in to the optimizer, but Torch will still compute gradients for all weights, as requires_grad is still True, according to this thread:

https://discuss.pytorch.org/t/passing-a-subset-of-the-parameters-to-an-optimizer-equivalent-to-setting-requires-grad-of-subset-only-to-true/42866/2