NVIDIA / Bobber

Containerized testing of system components that impact AI workload performance
MIT License
14 stars 5 forks source link

Synchronize SSH keys #38

Open roclark opened 3 years ago

roclark commented 3 years ago

Bobber relies on SSH keys that are baked into the images to enable multi-node communication. This forces users to build the image on one machine, save the image locally, copy it to all remote nodes, and load the copied image on those hosts. This process is long and tedious, but by replacing it with a synchronization method, makes it possible to run the build on each host and not need to copy images remotely.

Closes #1

Signed-Off-By: Robert Clark roclark@nvidia.com

roclark commented 3 years ago

Putting this here as a draft at the moment as I want to expand documentation and do further multi-node testing. The basic premise is to replace saving/copying a container from one node to all other nodes and instead build/launch the container on all nodes and run bobber sync --hosts host1,host2,host3,... from a single node to generate an SSH key that will be copied to the Bobber containers on all remote hosts.

Some of my thoughts/questions:

  1. Is the bash script secure? I wanted to add as many layers/parsing to ensure that we get expected input, but there's always a concern while shelling-out and SSH-ing to remote nodes, though hard-coding that process should eliminate a good chunk of the risk.
  2. Do we have this as the only method documented, or still list the existing method?
roclark commented 3 years ago

Still planning on keeping this open for now, but this will likely be closed in favor of #65.