NVIDIA / modulus-sym

Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts
https://developer.nvidia.com/modulus
Apache License 2.0
147 stars 60 forks source link

:bug: [BUG]: Lbfgs optimizer set the initial state as the final state #21

Open zhangzhen117 opened 1 year ago

zhangzhen117 commented 1 year ago

Hi, report a bug about the L-BFGS optimizer. I added the training loss log during the BFGS train and found that the BFGS optimizer used the initial loss (1.361e-02) as the final loss. The same value is shown in the TensorBoard. It is not clear on which state is the inference based. I think it is just an output error. The model is trained finely, and the saved network.pth file is right.

[14:08:36] - attempting to restore from: outputs/NS_inverse [14:08:36] - Success loading optimizer: outputs/NS_inverse/optim_checkpoint.0.pth [14:08:36] - Success loading model: outputs/NS_inverse/uvp_network.0.pth [14:08:40] - lbfgs optimizer selected. Setting max_steps to 0 [14:08:43] - [step: 0] lbfgs optimization in running [14:08:52] - [iter: 0] loss: 1.361e-02 [14:09:07] - [iter: 200] loss: 1.342e-02 [14:09:20] - [iter: 400] loss: 1.326e-02 [14:09:34] - [iter: 600] loss: 1.316e-02 [14:09:48] - [iter: 800] loss: 1.306e-02 [14:10:02] - [iter: 1000] loss: 1.293e-02 [14:10:16] - [iter: 1200] loss: 1.282e-02 [14:10:29] - [iter: 1400] loss: 1.272e-02 [14:10:43] - [iter: 1600] loss: 1.261e-02 [14:10:57] - [iter: 1800] loss: 1.254e-02 [14:11:11] - [iter: 2000] loss: 1.244e-02 [14:11:25] - [iter: 2200] loss: 1.235e-02 [14:11:38] - [iter: 2400] loss: 1.226e-02 [14:11:52] - [iter: 2600] loss: 1.219e-02 [14:12:06] - [iter: 2800] loss: 1.211e-02 [14:12:20] - lbfgs optimization completed after 3000 steps [14:12:20] - [step: 0] record constraint batch time: 5.271e-01s [14:12:33] - [step: 0] record inferencers time: 1.297e+01s [14:12:33] - [step: 0] saved checkpoint to outputs/NS_inverse [14:12:33] - [step: 0] loss: 1.361e-02 [14:12:33] - [step: 0] reached maximum training steps, finished training!