[ ] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lightning etc.)
What is your question?
I'm training a neural net on a cluster having Comet for logging. Sometimes a ReadTimedoutError is observed, probably due to cluster-related networking matters:
I am certain that a stable connection was restored at some point, but the logs from the job are gone.
How to handle such connection issues gracefully? Ideally, storing everything locally when there is no connection and sending it back to Comet when the connection reappears.
Before Asking:
What is your question related to?
What is your question?
I'm training a neural net on a cluster having Comet for logging. Sometimes a
ReadTimedoutError
is observed, probably due to cluster-related networking matters:The job continues, but the logging freezes somewhere around the point of the tenth error:
I am certain that a stable connection was restored at some point, but the logs from the job are gone.
How to handle such connection issues gracefully? Ideally, storing everything locally when there is no connection and sending it back to Comet when the connection reappears.