Closed odbadrakh closed 7 months ago
Hi @odbadrakh we support multiple logging backends. I would recommend using tensorboard. Tensorboard logging should be enabled by default when you train a model. You can find instructions how to use it in the readme file under the chapter "Logging". Best regards, Jonas
Hello Jonas,
Thanks very much for the response.
When I use the command “tensorboard --logdir=
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.
How should I use this command? I am running on a compute cluster.
Another questions is how to restart the training from the checkpoints? I am sorry to ask these trivial questions. I am entirely new to all these.
Thank very much again,
Od K Odbadrakh National Institute for Computational Sciences University of Tennessee, Knoxville Oak Ridge National Laboratory Oak Ridge, TN Email: @.***
On Feb 15, 2024, at 5:13 PM, Jonas Lederer @.***> wrote:
Hi @odbadrakhhttps://github.com/odbadrakh we support multiple logging backends. I would recommend using tensorboard. Tensorboard logging should be enabled by default when you train a model. You can find instructions how to use it in the readme file under the chapter "Logging". Best regards, Jonas
— Reply to this email directly, view it on GitHubhttps://github.com/atomistic-machine-learning/schnetpack/issues/607#issuecomment-1947423443, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADIPTZWYMQR5E6AQ3PJMWMTYT2B65AVCNFSM6AAAAABDKWD75OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXGQZDGNBUGM. You are receiving this because you were mentioned.Message ID: @.***>
The easiest way to check your tensorboard log files, would be copying it to your local machnine.
You could also use sshfs
to do it in real time.
you can resume your training from the last checkpoint by adding run.id=<directory/of/trained/model> in the CLI
Dear All,
I am new to the schnetpack. I am training a model using config.yaml files, not doing any scripting in python.
My question:
Is there any directive to write progress of the training, such as convergence of loss vs epoch number etc?
Thank you,