questionis about epoch - Githubissues

gaozhihan commented 1 year ago

This phenomenon does not necessarily mean that these training epochs end unexpectedly. One possible cause is that the outputs to the console somehow get flushed so that \r fails to get the original position to overwrite the previous logs. Could you please use TensorBoard.dev to check the log, especially the scalars that are logged with self.log(..., on_step=True, ...), e.g., train_loss, to see if they got interrupted unexpectedly?

https://github.com/amazon-science/earth-forecasting-transformer/blob/093085ca0f7844c47d352fb55e32768b8f0bf07b/scripts/cuboid_transformer/nbody/README.md?plain=1#L8-L12

Change the folder name tmp_nbody to the name you specified by --save when running the training script.

sxjscience commented 1 year ago

Yes, it will generate a tensorboard file and you may visualize the training loss via tensorboard.

fizzking commented 1 year ago

Yes, it will generate a tensorboard file and you may visualize the training loss via tensorboard.

I tried to input the code to upload the experiment log to tensorboard, but when I logged into my Google account, the page would not be displayed.

gaozhihan commented 1 year ago

The authorization of TensorBoard.dev should work correctly with appropriate network connection. If it still fails, you could try to use tensorboard to visualize your training log locally, following the official instructions:

tensorboard --logdir YOUR_EXP_DIR

gaozhihan commented 1 year ago

Thanks for your issue. Please feel free to reopen it if you have any further questions.

amazon-science / earth-forecasting-transformer

questionis about epoch #23