comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

How to recover logging after encountering a Connection timed out error #524

Closed kirilllzaitsev closed 11 months ago

kirilllzaitsev commented 11 months ago

Before Asking:

What is your question related to?

What is your question?

I'm training a neural net on a cluster having Comet for logging. Sometimes a ReadTimedoutError is observed, probably due to cluster-related networking matters:

[10/08/2023 23:17:51 WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.comet.com', port=443): Read timed out. (read timeout=10)")': /clientlib/status-report/update

The job continues, but the logging freezes somewhere around the point of the tenth error:

[10/08/2023 23:31:23 WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.comet.com', port=443): Read timed out. (read timeout=10)")': /clientlib/batch/logger/experiment/metric

I am certain that a stable connection was restored at some point, but the logs from the job are gone.

How to handle such connection issues gracefully? Ideally, storing everything locally when there is no connection and sending it back to Comet when the connection reappears.

kirilllzaitsev commented 11 months ago

Duplicate of https://github.com/comet-ml/issue-tracking/issues/463

dsblank commented 11 months ago

Answered on #463 . Closing this duplicate. If more issues or questions, please re-open, or start a new ticket. Thank you!