Closed zhujiem closed 6 months ago
This is expected behavior, typically once d has reached a good value dhat starts to decrease and go negative afterwards. In your logs dhat must have been positive at some point as d has grown to 0.0038. Is the optimizer working for you when you run the full training job?
We have also developed methods that don't use a max operation, such as Mechanic: https://github.com/optimizedlearning/mechanic.
Thanks for your answer!
In my case, the problem is that d_hat becomes to negative at step=20 (total=5000 for one epoch), as shown in my log (log_every=10). So it functions like setting an initial learning rate to dlr: 0.0038455569066480266 at the beginning and nevers changes. So far, I havent obtained better results than adam. I guess I need to tune other hyper-parameters or change LR scheduler? Or maybe the task does not fit.
My task is for recommenation model training. My baseline uses adam with lr=0.001, and ReduceLROnPlateau scheduler.
Thanks for your suggestion. I will try Mechanic later.
It's normal for d to stop changing close to the beginning of optimization. Generally with a good choice of scheduler I would expect it to match the performance of Adam with a hand-tuned LR, but not necessarily exceed it. I would recommend the use of a linear decay scheduler or cosine annealing, they both work pretty well.
In the code, d takes the max of the following terms. But in my running log, I found d_hat is a negative and becomes smaller and smaller. Thus, d never changes since min(d_hat, d*growth_rate) is negative. Is this normal?
https://github.com/facebookresearch/dadaptation/blob/main/dadaptation/dadapt_adam.py#L217C13-L217C50
d = max(d, min(d_hat, d*growth_rate))
Logs: (log_every=10)