facebookresearch / optimizers

For optimization algorithm research and development.
Other
282 stars 27 forks source link

Failed to compute eigendecomposition #13

Open aykamko opened 1 year ago

aykamko commented 1 year ago

We're seeing this error message about 5 minute into training.

WARNING:distributed_shampoo.utils.matrix_functions:Failed to compute eigendecomposition in torch.float32 precision with exception linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 1).! Retrying in double precision...

Any ideas how we can fix this / avoid this?

hjmshi commented 12 months ago

Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this?

This could be due to multiple reasons:

  1. If you're setting the learning rate too large, this can cause inf or nan values to get inserted into the preconditioner matrix. In this case, we would expect increasing the precision (being done after this warning) to not resolve the issue - you'll likely begin to see nan values in the loss after the matrix root inverse is computed.
  2. Alternatively, we have found that the eigendecomposition solver can be unstable, especially for some low-rank matrices. One approach to avoid this is to set a larger start_preconditioning_step, which will ensure that the matrix is more well-behaved prior to applying the eigh solver.

It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case.

aykamko commented 11 months ago

Thanks for the response!

Previous config used AdamW:

lr = 1e-4
betas = (0.9, 0.999)
eps = 1e-8
weight_decay = 1e-2

Current Shampoo config:

        lr: 1e-4
        betas: [0.9, 0.999]
        epsilon: 1e-12
        weight_decay: 1e-02
        max_preconditioner_dim: 8192
        precondition_frequency: 100
        use_decoupled_weight_decay: True
        grafting_type: 4  # GraftingType.ADAM
        grafting_epsilon: 1e-08
        grafting_beta2: 0.999

In the meantime, I'll try to set a larger start_preconditioning_step.

I also saw this warning in your README:

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: https://github.com/pytorch/pytorch/issues/94772.

We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue?

hjmshi commented 11 months ago

@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊

vishaal27 commented 1 week ago

Hi @aykamko, did increasing the start_preconditioning_step work for you?

tsunghsienlee commented 4 days ago

Hi @aykamko, did increasing the start_preconditioning_step work for you?

Hi @vishaal27 , did you encounter similar issue like @aykamko in your usage?