amazon-science / earth-forecasting-transformer

Official implementation of Earthformer
Apache License 2.0
359 stars 61 forks source link

questionis about epoch #23

Closed fizzking closed 1 year ago

gaozhihan commented 1 year ago

This phenomenon does not necessarily mean that these training epochs end unexpectedly. One possible cause is that the outputs to the console somehow get flushed so that \r fails to get the original position to overwrite the previous logs. Could you please use TensorBoard.dev to check the log, especially the scalars that are logged with self.log(..., on_step=True, ...), e.g., train_loss, to see if they got interrupted unexpectedly?

https://github.com/amazon-science/earth-forecasting-transformer/blob/093085ca0f7844c47d352fb55e32768b8f0bf07b/scripts/cuboid_transformer/nbody/README.md?plain=1#L8-L12

Change the folder name tmp_nbody to the name you specified by --save when running the training script.

sxjscience commented 1 year ago

Yes, it will generate a tensorboard file and you may visualize the training loss via tensorboard.

fizzking commented 1 year ago

Yes, it will generate a tensorboard file and you may visualize the training loss via tensorboard.

I tried to input the code to upload the experiment log to tensorboard, but when I logged into my Google account, the page would not be displayed. image image image image

gaozhihan commented 1 year ago

The authorization of TensorBoard.dev should work correctly with appropriate network connection. If it still fails, you could try to use tensorboard to visualize your training log locally, following the official instructions:

tensorboard --logdir YOUR_EXP_DIR
gaozhihan commented 1 year ago

Thanks for your issue. Please feel free to reopen it if you have any further questions.