geoelements / gns

Graph Network Simulator
https://www.geoelements.org/gns/
Other
136 stars 33 forks source link

Logging #79

Closed kks32 closed 4 months ago

kks32 commented 4 months ago

Describe the PR Add logging with tensorboard and progress bar with each epoch

Related Issues/PRs

74

Additional context Also added formatter to auto format our python files. This shows I touched a lot of files, but it is just formatting. Please only check train.py for review.

Outputs look like this:

rank = None, cuda = False
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 994/994 [03:44<00:00,  4.42batch/s, avg_loss=0.34, loss=0.06, lr=1.00e-04]
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 994/994 [03:58<00:00,  4.17batch/s, avg_loss=0.20, loss=0.19, lr=9.99e-05]
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [07:43<00:00, 231.64s/epoch]

Loss histories are available in logs via tensorboard.

Screenshot 2024-06-24 at 6 06 41 AM

Also includes metadata and hyperparameters

yjchoi1 commented 4 months ago

I have reviewed the changes in train.py and looks great with tensorboard monitoring. I think npeochs and nsteps are both fine, but nsteps would have more control to stop in the middle of training, particularly when there are a lot of training examples are in train.npz.

kks32 commented 4 months ago

Thanks @yjchoi1 . I'm working on cleaning up and adding new features. We can discuss epochs vs steps as an RFC. We are planning a rework for V2. So please add RFCs to improve the GNS code.

https://github.com/geoelements/gns/milestone/2