[ ] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lightning etc.)
What is your question?
Running training on a remote server, at some point in time the message COMET ERROR: Heartbeat processing error appeared, marking the end to the entire Comet-based logging logic. No further errors showed off. On Comet, the experiment displays as ended, though the actual job on the server is still running.
I'm wondering if there are workarounds for this case? Safeguards to store the data locally if it can't be uploaded at a given instant in time to Comet? Retries at some later point in time to resume the experiment?
Yes, there are fallbacks for situations where you might have an inconsistent connection to your Comet server:
In the current comet_ml SDK, it should create an OfflineExperiment zip file in case you lose connection. It should be the case that it falls back, saves the data, and even restores if the connection because reconnected. The current version of comet_ml is 3.38.1.
Before Asking:
What is your question related to?
What is your question?
Running training on a remote server, at some point in time the message
COMET ERROR: Heartbeat processing error
appeared, marking the end to the entire Comet-based logging logic. No further errors showed off. On Comet, the experiment displays as ended, though the actual job on the server is still running.I'm wondering if there are workarounds for this case? Safeguards to store the data locally if it can't be uploaded at a given instant in time to Comet? Retries at some later point in time to resume the experiment?
Code
Not applicable.
What have you tried?
Not applicable.