allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
381 stars 131 forks source link

Opening web app connection refused #195

Open P1G5ML opened 1 year ago

P1G5ML commented 1 year ago

Hello all,

I hope whoever reads this is having a great day!!

I've been running into constant issues with ClearML and am hoping for some help. I have one machine with the ClearML Server + Docker environment set up. Having 1 issue running things on the docker install but that is another issue.

Mainly the two other machines are stuck at 'err_connection_refused' errors (or along those lines), just when trying to open the ClearML web interface on LocalHost8080. (NOT INVOLVING TRYING TO CONNECT TO THE local SERVER) Just the simple ClearML, install +verify credentials. I've tried everything I can think of for opening the ports, trying to disable the firewall, ect. and am coming up short.

On the two machines I am having issues, I have also tried making credentials on another machine and moving the info over.

The odd thing that is getting me is that it seems to be a not entirely consistent issue.

When trying to verify the credentials(made on another computer), it errors out at could not verify the credentials. But eventually after trying enough times it eventually works. Verified the credentials, and reports local information to the clearML web app. That works anywhere from a hour to a few days. But has always gone back to a 'NewConnectionRefused" loop when trying to run a local test/project and view the info reported on the clearml webapp (the project never gets to run, just stuck in the loop) :

image

It get's really strange odd (for me), Because it will be stuck in that loop, until once every not and then (seemingly at random) when it will finally connect sometimes after looping around for a few minuets. (Most of the time it will just be stuck in the loop).

Even when it finally dose connect though, it seems as if the issue stays consistent before too long and I can't get it to ever connect again.

Hoping someone has ran into something similar or has some guidance on anything I can try at all!

jkhenning commented 1 year ago

@P1G5ML What you're seeing is not an issue with credentials, but purely a network issue. The retrying error means the client (python code) cannot reach the server socket (or starts a connection and the connection is broken at the socket level). This is not an application (or client) issue, and the fact that sometimes it works indicates there's either a severe network outage or some network configuration (firewall/load balancer/proxy) in the way - it might be possible that someone is messing around with the 8008 port and redirects it to another target?

P1G5ML commented 1 year ago

UPDATE @jkhenning

Appreciate your input!

I figured it was a issue related to this. So at least as of now, I have forgone setting up ClearML using the web workspace(the websight with the username/password) entirely. So I just set up all worker machines, exclusively to the locally running sever; And seems to be working like that. looking at the local ClearML workspace, I have all the machines set up as workers with docker environments.

I won't be able to use the web workspace like this, but I can SSH into the local machine and view the same page locally for the same effect. good enough for me at the moment!

Again, thanks for your help @jkhenning!!!