facebookresearch / dadaptation

D-Adaptation for SGD, Adam and AdaGrad
MIT License
499 stars 19 forks source link

Running issues of d_hat < 0 #42

Closed zhujiem closed 6 months ago

zhujiem commented 6 months ago

In the code, d takes the max of the following terms. But in my running log, I found d_hat is a negative and becomes smaller and smaller. Thus, d never changes since min(d_hat, d*growth_rate) is negative. Is this normal?

https://github.com/facebookresearch/dadaptation/blob/main/dadaptation/dadapt_adam.py#L217C13-L217C50

d = max(d, min(d_hat, d*growth_rate))

Logs: (log_every=10)

2024-03-04 11:06:03,377 P44085 INFO ************ Epoch=1 start ************
2024-03-04 11:06:07,560 P44085 INFO lr: 1 dlr: 1e-06 d_hat: 0.0, d: 1e-06. sk_l1=1.4e-08 numerator_weighted=0.0e+00
2024-03-04 11:06:08,892 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.05789491198278567, d: 0.0038455569066480266. sk_l1=4.0e-03 numerator_weighted=-1.2e-07
2024-03-04 11:06:09,928 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.06040422325479096, d: 0.0038455569066480266. sk_l1=4.5e-03 numerator_weighted=-1.4e-07
2024-03-04 11:06:10,977 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.06598773259722039, d: 0.0038455569066480266. sk_l1=4.7e-03 numerator_weighted=-1.6e-07
2024-03-04 11:06:12,013 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.06679562915084818, d: 0.0038455569066480266. sk_l1=4.7e-03 numerator_weighted=-1.6e-07
2024-03-04 11:06:13,049 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.06732794356660553, d: 0.0038455569066480266. sk_l1=4.7e-03 numerator_weighted=-1.6e-07
2024-03-04 11:06:14,095 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.06763984920921864, d: 0.0038455569066480266. sk_l1=4.6e-03 numerator_weighted=-1.6e-07

...

2024-03-04 11:07:21,011 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07685106699482641, d: 0.0038455569066480266. sk_l1=3.4e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:22,036 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07695848427227471, d: 0.0038455569066480266. sk_l1=3.4e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:23,061 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07709818082261206, d: 0.0038455569066480266. sk_l1=3.3e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:24,089 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07720528635792316, d: 0.0038455569066480266. sk_l1=3.3e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:25,116 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07732868208281982, d: 0.0038455569066480266. sk_l1=3.3e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:26,142 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07743008661997182, d: 0.0038455569066480266. sk_l1=3.3e-03 numerator_weighted=-1.3e-07
2024-03-04 11:07:27,170 P44085 INFO lr: 1 dlr: 0.0038455569066480266 d_hat: -0.07754304511218482, d: 0.0038455569066480266. sk_l1=3.3e-03 numerator_weighted=-1.3e-07
adefazio commented 6 months ago

This is expected behavior, typically once d has reached a good value dhat starts to decrease and go negative afterwards. In your logs dhat must have been positive at some point as d has grown to 0.0038. Is the optimizer working for you when you run the full training job?

We have also developed methods that don't use a max operation, such as Mechanic: https://github.com/optimizedlearning/mechanic.

zhujiem commented 6 months ago

Thanks for your answer!

In my case, the problem is that d_hat becomes to negative at step=20 (total=5000 for one epoch), as shown in my log (log_every=10). So it functions like setting an initial learning rate to dlr: 0.0038455569066480266 at the beginning and nevers changes. So far, I havent obtained better results than adam. I guess I need to tune other hyper-parameters or change LR scheduler? Or maybe the task does not fit.

My task is for recommenation model training. My baseline uses adam with lr=0.001, and ReduceLROnPlateau scheduler.

Thanks for your suggestion. I will try Mechanic later.

adefazio commented 6 months ago

It's normal for d to stop changing close to the beginning of optimization. Generally with a good choice of scheduler I would expect it to match the performance of Adam with a hand-tuned LR, but not necessarily exceed it. I would recommend the use of a linear decay scheduler or cosine annealing, they both work pretty well.