Guitaricet / relora

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
https://arxiv.org/abs/2307.05695
Apache License 2.0
436 stars 39 forks source link

Inf checks warning on optimizer #10

Closed ElleLeonne closed 1 year ago

ElleLeonne commented 1 year ago

I'm attempting to implement this into a larger modular environment (pytorch lighting).

When I attempt to reset the optimizer states like you have here:

    def reset_optimizer(optimizer):
        for group in optimizers[0].param_groups:
            for p in group["params"]:
                param_state = optimizers[0].state[p]
                param_state["exp_avg"] = torch.zeros_like(p.data)
                param_state["exp_avg_sq"] = torch.zeros_like(p.data)

Upon running the next loop, I get the error:

File "/home/user/anaconda3/envs/science/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."

I'm sure you must've encountered this at some point yourself, so I'm curious how you managed to avoid this. Previously I had managed by just re-initializing the optimizer through accelerate, and I expected not using accelerate would fix the issue too, but it appears not to.

Guitaricet commented 1 year ago

Hi! I, unfortunately (or fortunately lol) didn't have this exact issue. It's probably because I was working with bf16 instead of fp16. A possible workaround I see (but no guarantees) here could be to multiply the values by zero instead of creating a new tensor or use some other inplace operation. You can check out this functional interace to ReLoRA for an inspiration.

https://github.com/Guitaricet/gpt-neox/blob/relora/megatron/relora/optim.py

(functional interface is still work in progress, and currently is not well-tested)

Or you can check out the dev branch of this repository (this is the code I'm actively working with) https://github.com/Guitaricet/relora/blob/15cf7b1f7e883727f1ed226dc035858accbcfd10/peft_pretraining/training_utils.py#L161

ElleLeonne commented 1 year ago

Neato, I'll take a look. Closing for now, thanks