allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
241 stars 92 forks source link

"it seems *api_server* is misconfigured" error #27

Open majdzr opened 4 years ago

majdzr commented 4 years ago

Hello, thanks for the millionth time for this great project. It literally saves me everyday.

However, I having a problem with trains agent.

Setup:

  1. Windows 10 local machine, connecting via ssh with the right trains ports 8008,8080,8081 to
  2. A remote ubuntu machine, running trains server and trains agent (docker mode).
  3. On the remote machine, a docker container is running (with --net==host, to freely communicate with the server) with my code.

Scenarios:

  1. If I run manually my code via terminal on the remote, it runs nicely, trains server logs it and I see it on my local machine in the webapp (on localhost:8080).
  2. I run trains-agent (docker mode with the same base image) on the same remote. I see the worker on the webapp, the worker pulls the cloned task (the same as 1, cloned from the webapp), and it starts running the docker base image I provided, installs the dependencies and I see all this in the same webapp. Then it fails and I receive the following error: trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://localhost:8008 ?

Any idea what is going on?

Thanks, Majd

jkhenning commented 4 years ago

Hi @majdzr,

It seems the issue is that the trains-agent (that runs in docker mode) can successfully translate localhost:8008 and connect to the server, while the Trains SDK running inside the docker received the same address but fails to resolve it. Try changing the trains-agent configuration for the api_server to <remote-machine-ip>:8008 and see if it works...

majdzr commented 4 years ago

Thanks for your reply.

I have changed the localhost with the remote-machine-ip on trains.conf file, without much luck. I still receive the same error (with the ip instead of localhost, after restarting the daemon). Any ideas?

jkhenning commented 4 years ago

Are you sure this IP is reachable from inside the docker? Can you try to run the docker manually, open shell and ping it?

If that's not the issue, you can always use the custom docker arguments configuration so that the agent will pass custom arguments to the docker (like --net==host or similar - see here)

majdzr commented 4 years ago

Thanks for your reply. Indeed, setting the --net=host in the configuration file as extra argument solved the mystery. Thanks!

By the way, any magic solution for mapping external directory for the docker (mainly to read data)? Obviously I can use the same logic as above but I was wondering if this is the best.

jkhenning commented 4 years ago

By the way, any magic solution for mapping external directory for the docker (mainly to read data)? Obviously I can use the same logic as above but I was wondering if this is the best.

Well, assuming this is a fixed folder that contains all your data, (which you will always use), I think the easiest way would be to have it mounted to the same folder inside any docker. You can just add it using the same extra_docker_args configuration setting, using the value -v host_data:/mnt/data