Support for AMs created out of pools of pre-provisioned VMs using containers

ibaldin commented 7 years ago

This is proposed by @clarisca as part of SciDAS - support for creating ORCA AMs that operate on a pool of pre-provisioned VMs of potentially varying sizes. There are several issues that must be considered

the different sizes of the VMs (should the topology description take this into account, or should a single AM operate only on a pool of 'like' VM sizes). In fact one can imagine this working on a pool of hardware nodes as well - not necessarily VMs.
how to clean up the VM after the user exits - containers offer a reasonable approach, regardless of whether the workload is interactive (user logs in) or not (i.e. worker node of some computational framework)
how to deal with retrieving, caching and installing user-specified images via ImageProxy

I propose doing this in two phases - Phase I involves only modifying ImageProxy and creating a new handler for issuing VMs. Phase II would add support for RDF specification of these AMs along with extensions to processing RDF and doing proper delegation of resources. The latter would also require controller extensions, as well as AM and broker controls.

The assumption in both cases that such AM operates only using commodity internet. Not clear how dynamic connectivity may work here. I suppose there could be a pool of VLANs available, but then not clear how to connect VMs to them dynamically. They can be plumbed statically upon creation, at the expense of security and performance isolation. In the former case no Net AM is needed, in the latter a traditional Net AM should work just fine.

Phase I:

Create a handler that is similar to xcat handler - picks a VM/baremetal node out of a pool of available nodes. On join:

Picks an available node. We have to make a simplifying assumption that no matter what size VM user requests, they get what is available. This will have limitations in that the manifest will not be reflective of the actual size of the VM.
Gets the IP address of the node associated with the VM/baremetal node to pass back to the user
Downloads and installs a Docker or similar container image and starts the container.
Installs user keys

On leave:

stops the container

Modify ImageProxy to support new image format (container of choice).

Phase II:

Introduce AM topology description extension that allows describing AM as a pool of VMs or baremetal nodes.
Modify boot code to parse AM description and be able to extract VM/baremetal size information and pass it to broker as part of delegation
Modify broker to deal with delegations of this type, where accounting by core is no longer possible and resources must be delegated in increments of VMs
Modify controller to deal with broker delegations of this type and produce proper manifests.

@paul-ruth @anriban

@vjorlikowski is it possible to leverage xCAT and xCAT machinery for issuing nodes and IP addresses for this?

Comments welcome.

hinchliff commented 7 years ago

Some notes on the Docker aspects, without fully understanding the Handlers or ImageProxy :)

Downloads and installs a Docker or similar container image and starts the container.

In the simplest case, the user will want to run a Docker image that is available in the main Docker Hub registry. The user would only need to provide the image name (and optionally the image tag), e.g.:
- ubuntu
- ubuntu:17.04
- spotify/cassandra
- spotify/cassandra:cluster And then Orca can do a docker pull on that image name to download the image.
The user could also elect to use a Docker image that is hosted in a private repository. There may be additional authentication/login issues to handle with private repositories. But a docker pull should be able to download the image with the simple URL, e.g. docker pull myregistry.local:5000/testing/test-image
If the user doesn't upload the image to a repository, the user could use docker save to copy the image to a tar file. This use case is closest to the current ImageProxy model. Orca would need to download the tar file, and then load it into docker using docker load.

Installs user keys

We could potentially require the user to pre-populate their docker container with any keys they need.

Orca can install the keys on the Host (vm or baremetal), and then either volume mount or cp the keys to make them available to the container.

The closest thing to current behavior would probably be to copy the keys to a default location (for root user SSH).

We could give the user the option of declining to have keys copied into the container, if the image doesn't need or already provides them.

Or we could give the user the option of specifying the container path where keys will get copied to.

Port numbers will need to be managed somehow, since presumably we will need administrator SSH access to the Host. Docker allows exposed ports to be mapped between the host and container with the publish option of docker run. (So an SSH server in the container, nominally running at port 22 inside the container could be accessed using a different Host port.)

stops the container

We will probably want to name the containers (probably be UUID?) when doing the docker run, so that they can be explicitly stoped and rmed when desired.

We'll probably need to have support for user-specified docker run options.

publish -- docker won't allow network access to the container unless the ports are explicitly exposed when doing the docker run.
device -- This may not need to be explicitly set by the user, but if GPU access is required, this will need to be specified in docker run.
And probably others.

Several other docker run options are commonly used, but we could in theory force the user to specify these in the container image (dockerfile) instead of runtime options:

command
entrypoint

ibaldin commented 7 years ago

Sounds like additional things to worry about:

Open ports (need to be part of the configuration/request somehow)
How should we deal with interactive logins? Should dockers be required to have their own SSH running on a different port, or should we manipulate the login system on the host node to start a docker-enclosed shell instance upon login into host VM ssh for the user?

ibaldin commented 7 years ago

BTW, not a fan of pre-populating dockers with user SSH keys - bad security practice (even if those are public)

clarisca commented 7 years ago

@hinchliff : my preference to pass anything (including credentials) into the container would be either via a wrapper through the ENTRYPOINT (e.g., import credentials from ephemeral nodes in Zookeeper) or use the -e option. We used the latter in the RADII/HydroShare integration. For a more production environment I favor the former. I presume that the container would be checkout from the VM through post-boot script in the provisioning phase.

Regarding managing ports we should probably talk to Mike S who has done this. He is an expert in containers. I'll setup a meeting with him.

hinchliff commented 7 years ago

Should dockers be required to have their own SSH running on a different port

The containers can (internally) run SSH (or any service) on any port, and Docker will give the host the ability to map those container ports to different host port numbers: --publish or -p. e.g.

$ docker run -p 127.0.0.1:80:8080 ubuntu bash This binds port 8080 of the container to port 80 on 127.0.0.1 of the host machine.

Managing that might get complicated, so talking with someone who has already done it would be useful.

ibaldin commented 7 years ago

That's why I'm wondering if we'd rather create restricted shells for users that use host's SSHd to log them into bash running inside their container.

clarisca commented 7 years ago

@hinchliff : this approach is to my knowledge widely adopted and is what we used for our RADII/HydroShare prototype. I have emailed Mike but he has not responded yet.

Regarding your comment "managing that might get complicated". What is "that"?

hinchliff commented 7 years ago

'That' is managing any port mappings. Orca will need to track that user requested ports p1, p2... pn for their container, and we gave them corresponding ports q1, q2... qn on the host. Might get more complicated when you start to allow more than one container on each host.

clarisca commented 7 years ago

We could use the feature that maps a container port to a range of ports in the host, e.g., "docker run -d -p 7000-8000:4000 myApp" This would bind port 4000 in the container to a random port between 7000 and 8000 on the host, depending upon the port that is available in the host at that time. We could then report back the allocated port to ORCA directly or through some side-channel mechanism (Zookeeper). Source: https://bobcares.com/blog/docker-port-expose/ Would that work?

hinchliff commented 7 years ago

Interesting. I guess you can also just let Docker completely pick the ports:

To randomly map any network port inside a container to a port in the Docker host, the ‘-P’ option can be used in ‘docker run’: docker run -d -P webapp To see the port mapping for this container, you can use ‘docker ps’ command after creating the container.

Makes it a little bit more of a trick to find the port mapping to report back to the user, but less management.

ibaldin commented 7 years ago

docker exec -i -t <guid> /bin/bash

attaches a shell to docker. We may need a dynamic wrapper script to do this, so this can be added to /etc/passwd for a new user entry, but the end result is the same - user upon login using host's SSH (no need to package SSH into docker, only bash) ends up inside the docker environment.

So the handler on join needs to add user (update /etc/passwd file and create home directory) and push user key into /home/user/.ssh/ on host VM. On leave, it is simply deleting the user (userdel) together with the home directory.

No need to manage ports at all. The caveat is disallowing root login for users (since it is root@host VM), but this is now a standard GENI practice - most tools expect to use username@host, not root@host. Multiple user logins should be allowable - a matter of creating multiple /etc/passwd entries mapped to the same container (and remembering to delete them).

clarisca commented 7 years ago

@ibaldin : I used docker exec to run commands. For instance, I have used docker exec to run icommands in a container configured as an iRODS client node. I still think some port configuration is needed for a long-running application-container, e.g., iRODS resource node and HTCondor. We will meet with Mike and further discuss our use case.

mjstealey commented 7 years ago

no need to package SSH into docker, only bash

@ibaldin - Agreed that handing off to docker exec -u <YOUR_USER> -ti <GUID> /bin/bash is the way to go.

Do note that if you package SSH into a docker container where the host also uses SSH, you'll need to remap the SSH port from the host's point of view to the container. i.e. using port 2022 instead of 22.

You could also allow SSH to be run by a non-root user within the container so if the user does log in via SSH they would not be root on the container.

Example: In the docker-entrypoint script you'd want to define that SSH is owned by a non-root user and run on a port other than 22.

...
chown -R <YOUR_USER>:<YOUR_GROUP> /etc/ssh
sed -i "/\<UsePrivilegeSeparation\>/c\UsePrivilegeSeparation no" /etc/ssh/sshd_config
sed -i "/\<Port 22\>/c\Port 2022" /etc/ssh/sshd_config
runuser -p -u <YOUR_USER> -g <YOUR_GROUP> /usr/sbin/sshd
...

clarisca commented 7 years ago

@mjstealey : I will take some of our time on Friday to ask you more about this. I don't understand how this approach alone addresses the deployment of a long running service such as a HTCondor Worker which maintains a long-lived connection with a HTCondor Master (an overlay) and may open new connections to other nodes. It seems to me that port management has to be handled by something either outside Docker or inside. I think it is possible that there are different application models, e.g., ssh vs distributed service that have different requirements and need different configurations. Or perhaps above addresses all of them and I don't see it yet.

mjstealey commented 7 years ago

Makes it a little bit more of a trick to find the port mapping to report back to the user, but less management.

@hinchliff - using docker inspect <CONTAINER_ID> will return a whole bunch of configuration information as JSON.

So, for a process that uses lots of ports like iRODS, you could do something like this to get the Networking.Ports information

$ docker run -d -p 1247:1247 -p 1248:1248 -p 20000-20199:20000-20199 --name irods4.2 mjstealey/irods-provider-postgres:latest

$ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS                                                                              NAMES
34f6a996a3f2        mjstealey/irods-provider-postgres:latest   "/irods-docker-ent..."   26 seconds ago      Up 22 seconds       0.0.0.0:1247-1248->1247-1248/tcp, 0.0.0.0:20000-20199->20000-20199/tcp, 5432/tcp   irods4.2

$ docker inspect --format=" {{ .NetworkSettings.Ports }} " irods4.2
 map[20117/tcp:[{0.0.0.0 20117}] 20144/tcp:[{0.0.0.0 20144}] 20024/tcp:[{0.0.0.0 20024}] 20062/tcp:[{0.0.0.0 20062}] 20112/tcp:[{0.0.0.0 20112}] 20083/tcp:[{0.0.0.0 20083}] 20119/tcp:[{0.0.0.0 20119}] 20152/tcp:[{0.0.0.0 20152}] 20158/tcp:[{0.0.0.0 20158}] 20190/tcp:[{0.0.0.0 20190}] 20012/tcp:[{0.0.0.0 20012}] 20021/tcp:[{0.0.0.0 20021}] 20030/tcp:[{0.0.0.0 20030}] 20105/tcp:[{0.0.0.0 20105}] 20106/tcp:[{0.0.0.0 20106}] 20140/tcp:[{0.0.0.0 20140}] 20148/tcp:[{0.0.0.0 20148}] 20009/tcp:[{0.0.0.0 20009}] 20074/tcp:[{0.0.0.0 20074}] 20076/tcp:[{0.0.0.0 20076}] 20065/tcp:[{0.0.0.0 20065}] 20071/tcp:[{0.0.0.0 20071}] 20084/tcp:[{0.0.0.0 20084}] 20099/tcp:[{0.0.0.0 20099}] 20108/tcp:[{0.0.0.0 20108}] 20038/tcp:[{0.0.0.0 20038}] 20049/tcp:[{0.0.0.0 20049}] 20063/tcp:[{0.0.0.0 20063}] 20122/tcp:[{0.0.0.0 20122}] 20150/tcp:[{0.0.0.0 20150}] 20178/tcp:[{0.0.0.0 20178}] 20193/tcp:[{0.0.0.0 20193}] 20032/tcp:[{0.0.0.0 20032}] 20043/tcp:[{0.0.0.0 20043}] 20123/tcp:[{0.0.0.0 20123}] 20100/tcp:[{0.0.0.0 20100}] 20104/tcp:[{0.0.0.0 20104}] 20118/tcp:[{0.0.0.0 20118}] 20128/tcp:[{0.0.0.0 20128}] 20135/tcp:[{0.0.0.0 20135}] 20013/tcp:[{0.0.0.0 20013}] 20068/tcp:[{0.0.0.0 20068}] 20080/tcp:[{0.0.0.0 20080}] 20067/tcp:[{0.0.0.0 20067}] 20091/tcp:[{0.0.0.0 20091}] 20172/tcp:[{0.0.0.0 20172}] 1247/tcp:[{0.0.0.0 1247}] 1248/tcp:[{0.0.0.0 1248}] 20040/tcp:[{0.0.0.0 20040}] 20109/tcp:[{0.0.0.0 20109}] 20147/tcp:[{0.0.0.0 20147}] 20166/tcp:[{0.0.0.0 20166}] 20184/tcp:[{0.0.0.0 20184}] 20186/tcp:[{0.0.0.0 20186}] 20019/tcp:[{0.0.0.0 20019}] 20051/tcp:[{0.0.0.0 20051}] 20078/tcp:[{0.0.0.0 20078}] 20058/tcp:[{0.0.0.0 20058}] 20060/tcp:[{0.0.0.0 20060}] 20061/tcp:[{0.0.0.0 20061}] 20086/tcp:[{0.0.0.0 20086}] 20121/tcp:[{0.0.0.0 20121}] 20031/tcp:[{0.0.0.0 20031}] 20041/tcp:[{0.0.0.0 20041}] 20055/tcp:[{0.0.0.0 20055}] 20127/tcp:[{0.0.0.0 20127}] 20145/tcp:[{0.0.0.0 20145}] 20160/tcp:[{0.0.0.0 20160}] 20136/tcp:[{0.0.0.0 20136}] 20164/tcp:[{0.0.0.0 20164}] 20168/tcp:[{0.0.0.0 20168}] 20170/tcp:[{0.0.0.0 20170}] 20174/tcp:[{0.0.0.0 20174}] 20014/tcp:[{0.0.0.0 20014}] 20037/tcp:[{0.0.0.0 20037}] 20134/tcp:[{0.0.0.0 20134}] 20196/tcp:[{0.0.0.0 20196}] 20114/tcp:[{0.0.0.0 20114}] 20133/tcp:[{0.0.0.0 20133}] 20010/tcp:[{0.0.0.0 20010}] 20011/tcp:[{0.0.0.0 20011}] 20131/tcp:[{0.0.0.0 20131}] 20094/tcp:[{0.0.0.0 20094}] 20113/tcp:[{0.0.0.0 20113}] 20120/tcp:[{0.0.0.0 20120}] 20154/tcp:[{0.0.0.0 20154}] 20165/tcp:[{0.0.0.0 20165}] 20002/tcp:[{0.0.0.0 20002}] 20039/tcp:[{0.0.0.0 20039}] 20085/tcp:[{0.0.0.0 20085}] 20056/tcp:[{0.0.0.0 20056}] 20079/tcp:[{0.0.0.0 20079}] 20107/tcp:[{0.0.0.0 20107}] 20111/tcp:[{0.0.0.0 20111}] 20000/tcp:[{0.0.0.0 20000}] 20006/tcp:[{0.0.0.0 20006}] 20034/tcp:[{0.0.0.0 20034}] 20191/tcp:[{0.0.0.0 20191}] 20015/tcp:[{0.0.0.0 20015}] 20023/tcp:[{0.0.0.0 20023}] 20177/tcp:[{0.0.0.0 20177}] 20088/tcp:[{0.0.0.0 20088}] 20137/tcp:[{0.0.0.0 20137}] 20169/tcp:[{0.0.0.0 20169}] 20185/tcp:[{0.0.0.0 20185}] 20045/tcp:[{0.0.0.0 20045}] 20052/tcp:[{0.0.0.0 20052}] 20057/tcp:[{0.0.0.0 20057}] 20189/tcp:[{0.0.0.0 20189}] 20097/tcp:[{0.0.0.0 20097}] 20098/tcp:[{0.0.0.0 20098}] 20130/tcp:[{0.0.0.0 20130}] 20171/tcp:[{0.0.0.0 20171}] 20077/tcp:[{0.0.0.0 20077}] 20095/tcp:[{0.0.0.0 20095}] 20096/tcp:[{0.0.0.0 20096}] 20151/tcp:[{0.0.0.0 20151}] 20162/tcp:[{0.0.0.0 20162}] 20020/tcp:[{0.0.0.0 20020}] 20028/tcp:[{0.0.0.0 20028}] 20138/tcp:[{0.0.0.0 20138}] 20035/tcp:[{0.0.0.0 20035}] 20142/tcp:[{0.0.0.0 20142}] 20176/tcp:[{0.0.0.0 20176}] 20101/tcp:[{0.0.0.0 20101}] 20132/tcp:[{0.0.0.0 20132}] 20143/tcp:[{0.0.0.0 20143}] 20167/tcp:[{0.0.0.0 20167}] 20195/tcp:[{0.0.0.0 20195}] 20066/tcp:[{0.0.0.0 20066}] 20069/tcp:[{0.0.0.0 20069}] 20092/tcp:[{0.0.0.0 20092}] 20081/tcp:[{0.0.0.0 20081}] 20008/tcp:[{0.0.0.0 20008}] 20027/tcp:[{0.0.0.0 20027}] 20044/tcp:[{0.0.0.0 20044}] 20153/tcp:[{0.0.0.0 20153}] 20198/tcp:[{0.0.0.0 20198}] 20046/tcp:[{0.0.0.0 20046}] 20053/tcp:[{0.0.0.0 20053}] 20073/tcp:[{0.0.0.0 20073}] 20188/tcp:[{0.0.0.0 20188}] 20146/tcp:[{0.0.0.0 20146}] 20155/tcp:[{0.0.0.0 20155}] 20163/tcp:[{0.0.0.0 20163}] 20087/tcp:[{0.0.0.0 20087}] 20093/tcp:[{0.0.0.0 20093}] 20115/tcp:[{0.0.0.0 20115}] 20157/tcp:[{0.0.0.0 20157}] 20175/tcp:[{0.0.0.0 20175}] 20025/tcp:[{0.0.0.0 20025}] 20036/tcp:[{0.0.0.0 20036}] 20075/tcp:[{0.0.0.0 20075}] 20182/tcp:[{0.0.0.0 20182}] 20156/tcp:[{0.0.0.0 20156}] 20173/tcp:[{0.0.0.0 20173}] 20197/tcp:[{0.0.0.0 20197}] 20029/tcp:[{0.0.0.0 20029}] 20033/tcp:[{0.0.0.0 20033}] 20149/tcp:[{0.0.0.0 20149}] 20072/tcp:[{0.0.0.0 20072}] 20129/tcp:[{0.0.0.0 20129}] 20159/tcp:[{0.0.0.0 20159}] 20180/tcp:[{0.0.0.0 20180}] 20183/tcp:[{0.0.0.0 20183}] 20018/tcp:[{0.0.0.0 20018}] 20042/tcp:[{0.0.0.0 20042}] 20047/tcp:[{0.0.0.0 20047}] 20102/tcp:[{0.0.0.0 20102}] 20141/tcp:[{0.0.0.0 20141}] 20181/tcp:[{0.0.0.0 20181}] 20187/tcp:[{0.0.0.0 20187}] 20054/tcp:[{0.0.0.0 20054}] 20059/tcp:[{0.0.0.0 20059}] 20070/tcp:[{0.0.0.0 20070}] 20048/tcp:[{0.0.0.0 20048}] 20001/tcp:[{0.0.0.0 20001}] 20016/tcp:[{0.0.0.0 20016}] 20017/tcp:[{0.0.0.0 20017}] 20082/tcp:[{0.0.0.0 20082}] 20089/tcp:[{0.0.0.0 20089}] 20110/tcp:[{0.0.0.0 20110}] 20179/tcp:[{0.0.0.0 20179}] 20192/tcp:[{0.0.0.0 20192}] 20003/tcp:[{0.0.0.0 20003}] 20007/tcp:[{0.0.0.0 20007}] 20064/tcp:[{0.0.0.0 20064}] 5432/tcp:[] 20103/tcp:[{0.0.0.0 20103}] 20124/tcp:[{0.0.0.0 20124}] 20139/tcp:[{0.0.0.0 20139}] 20161/tcp:[{0.0.0.0 20161}] 20199/tcp:[{0.0.0.0 20199}] 20004/tcp:[{0.0.0.0 20004}] 20026/tcp:[{0.0.0.0 20026}] 20090/tcp:[{0.0.0.0 20090}] 20116/tcp:[{0.0.0.0 20116}] 20125/tcp:[{0.0.0.0 20125}] 20126/tcp:[{0.0.0.0 20126}] 20194/tcp:[{0.0.0.0 20194}] 20005/tcp:[{0.0.0.0 20005}] 20022/tcp:[{0.0.0.0 20022}] 20050/tcp:[{0.0.0.0 20050}]]

mjstealey commented 7 years ago

I don't understand how this approach alone addresses the deployment of a long running service such as a HTCondor Worker which maintains a long-lived connection with a HTCondor Master

@clarisca - It doesn't. Rather it was a comment on being able to attach to a container as a non-root user if such a thing is desired. Unsure if there will be a maintainer concept that would need access to a container within a node once it's instantiated.

By default docker runs as root, and any operations performed within the container would be as root unless explicitly specified otherwise. So the options are.

Allow access via docker exec to a non-root predefined user within the container
Make sshd a user owned process on a port other than 22 to a non-root predefined user within the container (aka root could not log into the container via ssh)

ibaldin commented 7 years ago

Had a meeting with Alan. He will think about the various design aspects, I'll do the flowchart with tasks. We may ask Mert help try the approach manually to make sure things work as expected.

hinchliff commented 7 years ago

I started a wiki page on the design aspects

ibaldin commented 7 years ago

@vjorlikowski suggests looking at https://github.com/jpetazzo/pipework PipeWork to constrain containers to specific network interfaces.

RENCI-NRIG / orca5

Support for AMs created out of pools of pre-provisioned VMs using containers #115