Open aykamko opened 1 year ago
Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this?
This could be due to multiple reasons:
inf
or nan
values to get inserted into the preconditioner matrix. In this case, we would expect increasing the precision (being done after this warning) to not resolve the issue - you'll likely begin to see nan
values in the loss after the matrix root inverse is computed.start_preconditioning_step
, which will ensure that the matrix is more well-behaved prior to applying the eigh
solver. It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case.
Thanks for the response!
Previous config used AdamW:
lr = 1e-4
betas = (0.9, 0.999)
eps = 1e-8
weight_decay = 1e-2
Current Shampoo config:
lr: 1e-4
betas: [0.9, 0.999]
epsilon: 1e-12
weight_decay: 1e-02
max_preconditioner_dim: 8192
precondition_frequency: 100
use_decoupled_weight_decay: True
grafting_type: 4 # GraftingType.ADAM
grafting_epsilon: 1e-08
grafting_beta2: 0.999
In the meantime, I'll try to set a larger start_preconditioning_step
.
I also saw this warning in your README:
Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: https://github.com/pytorch/pytorch/issues/94772.
We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue?
@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊
Hi @aykamko, did increasing the start_preconditioning_step
work for you?
Hi @aykamko, did increasing the
start_preconditioning_step
work for you?
Hi @vishaal27 , did you encounter similar issue like @aykamko in your usage?
We're seeing this error message about 5 minute into training.
Any ideas how we can fix this / avoid this?