determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.04k stars 357 forks source link

Support flat network topology between master, agents and containers #4100

Open lqf96 opened 2 years ago

lqf96 commented 2 years ago

Problem

Per the explanation of https://github.com/determined-ai/determined/issues/906#issuecomment-664494066, Determined assumes no direct connectivity between the master and the container. Instead, exposed container ports are published to agent's external IP address. This causes connectivity problems when I try to deploy Determined by putting master, agents and containers into the same Swarm overlay network. When the container is ready, Determined master and agent derive wrong IP address and port of the container, and will return 502 Bad Gateway error when trying to proxy Notebook, Tensorboard or Shell services. Therefore, I'd like Determined to add support for flat network topology, where the master, agents and containers are assumed to all be directly reachable from each other without any forwarding.

Solution

I have experimented with a possible solution in my direct-connectivity branch. The idea is that an extra config item called direct_connectivity is added to master config file. When this item is set to true, we assume the network topology is flat. In this case, workload containers will have their ports exposed but not published, and master and agents will connect to the containers by their original instead of forwarded IPs and ports. This approach seems to work at least for the JupyterLab, and I can refactor and rebase my fork and make a pull request if you deem it to be viable.

ioga commented 2 years ago

hello @lqf96 thanks for reaching out. the branch seems good to us. could you please submit it as a pull request?

lqf96 commented 2 years ago

Hi @ioga , yes I can do it in the weekend... However I'm not sure if any tests are required, and where should I add those tests. Maybe I also need to add this functionality to the documents?

lqf96 commented 1 year ago

Hi @ioga, sorry for the super delay but I've just submitted a pull request for this functionality... If you're interested feel free to give it a review and let me know your suggestions. Thanks!