atomistic-machine-learning / schnetpack

SchNetPack - Deep Neural Networks for Atomistic Systems
Other
774 stars 214 forks source link

Printing results using config.yaml #607

Closed odbadrakh closed 7 months ago

odbadrakh commented 7 months ago

Dear All,

I am new to the schnetpack. I am training a model using config.yaml files, not doing any scripting in python.

My question:

Is there any directive to write progress of the training, such as convergence of loss vs epoch number etc?

Thank you,

jnsLs commented 7 months ago

Hi @odbadrakh we support multiple logging backends. I would recommend using tensorboard. Tensorboard logging should be enabled by default when you train a model. You can find instructions how to use it in the readme file under the chapter "Logging". Best regards, Jonas

odbadrakh commented 7 months ago

Hello Jonas,

Thanks very much for the response.

When I use the command “tensorboard --logdir=” the error message is:

Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.

How should I use this command? I am running on a compute cluster.

Another questions is how to restart the training from the checkpoints? I am sorry to ask these trivial questions. I am entirely new to all these.

Thank very much again,

Od K Odbadrakh National Institute for Computational Sciences University of Tennessee, Knoxville Oak Ridge National Laboratory Oak Ridge, TN Email: @.***

On Feb 15, 2024, at 5:13 PM, Jonas Lederer @.***> wrote:

Hi @odbadrakhhttps://github.com/odbadrakh we support multiple logging backends. I would recommend using tensorboard. Tensorboard logging should be enabled by default when you train a model. You can find instructions how to use it in the readme file under the chapter "Logging". Best regards, Jonas

— Reply to this email directly, view it on GitHubhttps://github.com/atomistic-machine-learning/schnetpack/issues/607#issuecomment-1947423443, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADIPTZWYMQR5E6AQ3PJMWMTYT2B65AVCNFSM6AAAAABDKWD75OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXGQZDGNBUGM. You are receiving this because you were mentioned.Message ID: @.***>

jnsLs commented 7 months ago

The easiest way to check your tensorboard log files, would be copying it to your local machnine. You could also use sshfs to do it in real time.

you can resume your training from the last checkpoint by adding run.id=<directory/of/trained/model> in the CLI