allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

Connection issue with ClearML. #1003

Open P1G5ML opened 1 year ago

P1G5ML commented 1 year ago

Hello all,

Hope all is well with whoever reads this!

I have been racking my brain on this connection issue with clearML for the past few days, but am not finding any solutions, hoping someone might be able to provide some input. Too me it looks like a local issue, but after trying everything I can find, I'm still not having any luck.

With clearML, just doing the basic set up of clearml-init, I get a loop of connection errors ending in the credential verification failing. Just trying to do the basic default set up first.

This repeating :

Verifying credentials ... Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef777ef160>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef777ef310>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login

I'm left to assume it's a local issue, because after trying over & over again, I do eventually get the credentials to verify and everything to work. That lasts a couple of min to a day-or two before I'm back to the connection issue above. I get the same 'Retrying" error again later, even in line running a test just trying to report data &hardware info to ClearML, on a machine that was previously working a bit ago.

Any help at all or anything to try would be greatly appreciate; Thank you!!

jkhenning commented 1 year ago

Hi @P1G5ML, I think you're correct in your suspicion this is a local network issue - I would investigate any proxy/LB/FW between the client and the server. You can always try to send curl <server-address> to the server - any JSON response you get indicates a correct connection.