comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Workarounds for COMET ERROR: Heartbeat processing error #539

Closed kirilllzaitsev closed 8 months ago

kirilllzaitsev commented 8 months ago

Before Asking:

What is your question related to?

What is your question?

Running training on a remote server, at some point in time the message COMET ERROR: Heartbeat processing error appeared, marking the end to the entire Comet-based logging logic. No further errors showed off. On Comet, the experiment displays as ended, though the actual job on the server is still running.

I'm wondering if there are workarounds for this case? Safeguards to store the data locally if it can't be uploaded at a given instant in time to Comet? Retries at some later point in time to resume the experiment?

Code

Not applicable.

What have you tried?

Not applicable.

dsblank commented 8 months ago

Yes, there are fallbacks for situations where you might have an inconsistent connection to your Comet server:

  1. In the current comet_ml SDK, it should create an OfflineExperiment zip file in case you lose connection. It should be the case that it falls back, saves the data, and even restores if the connection because reconnected. The current version of comet_ml is 3.38.1.
  2. You can also use use OfflineExperiment directly: see https://www.comet.com/docs/v2/api-and-sdk/python-sdk/experiment-overview/#offlineexperiment