Bind-mounting Downloads

emalm commented 8 years ago

I've just added https://github.com/cloudfoundry-incubator/diego-dev-notes/blob/master/proposals/bind-mounting-downloads.md as a starting point for how we'll handle bind-mounting downloaded assets (read: buildpacks) to containers. Let's use this issue for discussion.

onsi commented 8 years ago

Great stuff!

Some questions:

Should we validate the the new behavior around the User field will work for the buildpacks?
If BindMount is present and CacheKey is not should we error during validation?
Instead of

The executor will have to process the container actions before creating the container to add the bind-mount directories to the container spec it sends to Garden. It will also have to retain and release the bind-mounted directories so the cacheddownloader know which ones containers are still using

should garden support attaching bind mounts dynamically? I suspect it would be less net effort across the two teams.

Moreover, if we had an implicit contract that the directories contained their data when BindMount is invoked then Garden-Windows could implement a BindMount that simply copies files in (unless they have better ideas :wink:).
"Are there malicious downloads we need to insulate ourselves from?" In practice, given the specific context at hand, I believe we're safe, though there aren't strong guarantees in the general. The assets we are considering BindMounting are the lifecycle binaries and the admin buildpacks. Both of these are under operator control and can be deemed safe.

benmoss commented 8 years ago

We just experimented with this and it looks like we can use symlinks to accomplish the same behavior, insofar as we understand it.

emalm commented 8 years ago

@onsi: I checked on the DEAs, and their buildpack files are owned by root and are provided via a read-only mount. We probably do want to spike out that buildpack staging does work as expected in this situation, but assuming we preserve the other-user read and execute bits correctly from the archive I expect that it'll be fine.

I think we do want to have the BBS validate that the CacheKey field is set if the DownloadAction has opted into bind-mounting.

We can talk to @goonzoid and @julz about the possibility of bind-mounting after container creation (http://jpetazzo.github.io/2015/01/13/docker-mount-dynamic-volumes/ may be relevant) but I don't think it would be terribly difficult for the executor to manage that in a pre-processing step.

I do expect that the CF use-cases are safe, since we're just talking buildpacks and maybe lifecycle binaries, but I think it's worth considering the general case. Maybe by the time it becomes a real concern for other use-cases we're able to integrate with a Volume Manager that can correctly constrain arbitrary external resources.

@benmoss: Awesome, that sounds promising! Is there any way to enforce a read-only constraint? I guess Windows containers could be on their honor not to mess with the contents. :) Alternately, you're not (yet) affected by big buildpacks, so maybe even just copying assets into the container is performant enough for now.

mhoran commented 8 years ago

@ematpl we can set the permissions on the source of the symlink to disallow write access, which should enforce read-only within the container (we tested this).

emalm commented 8 years ago

Fantastic, thanks, @mhoran.

goonzoid commented 8 years ago

Happy to look into what it would take to support dynamic bind mounts, but from what @ematpl says it sounds like it may actually be less net work to do a pre-processing step. Please let me know if anyone still thinks it's worth us looking at.

I do expect that the CF use-cases are safe, since we're just talking buildpacks and maybe lifecycle binaries, but I think it's worth considering the general case. Maybe by the time it becomes a real concern for other use-cases we're able to integrate with a Volume Manager that can correctly constrain arbitrary external resources.

Yep, agree completely. I think anything more complicated than the current proposed use case should prompt us to think more about a volume manager component that Diego can talk to directly.

julz commented 8 years ago

Yeah, dynamic bind mounts would be a world of hurt. We should keep in mind that some of these use cases may not require either bind mounts or stream in, they may just require using a layered filesystem. For example one really great way to get the droplet in to the container, without streamin, without bind mounts and with full quota support, is to just build a diff layer at staging time and extract it ok top of the rootfs at runtime. Yes, I just invented docker, but it's a really good idea...

emalm commented 8 years ago

We talked through some of the ramifications of the pre-processing step, and it's more complicated than it seems at first glance. One complication is how the executor would handle failures correctly. Let's say an executor Container definition has a DownloadAction wrapped in a TryAction. Right now, if that DownloadAction fails on download, the TryAction muffles the failure, and action execution proceeds unimpeded. If that DownloadAction is converted to opt into bind-mounting, and the in-advance directory+download setup fails, the executor now has a much harder time determining whether to proceed. Should it store the success/failure result from that preprocessing step and evaluate it in the context of the embedded DownloadAction, or should it just fail before container creation regardless? In any case, it's also a lot more code and state for Diego to maintain in the executor, when we'd much rather keep it local to the execution of the specified collection of actions. @tedsuo and @jfmyers9 have been spending a bunch of time in these corners of the executor recently, and might be better equipped to comment on the precise technical complexities.

I'm interested in what a real-deal Volume Manager could give us, but (1) we don't have it available to use through the Garden API right now, and (2) there's no Windows implementation yet. I see that there could be additional benefits for, say, managing layers containing droplets while still having them count towards disk quota usage, but for now we're just trying to solve making these larger static assets available to containers without expensive copying.

From the discussion ongoing right now in the garden Slack channel, it sounds like there are definitive technical and security problems with doing a bind-mount after container creation, but I'd like to make sure that's the case before we write off the dynamic option as infeasible at the Garden layer.

goonzoid commented 8 years ago

As I understand it (@julz can correct me if I'm wrong or miss something), there are a couple of problems:

For unprivileged containers (which we always want to use if possible) it's impossible to do a mount of any sort because we drop the capabilities required for all processes running in that container. Not dropping those capabilities represents a real and significant security risk.
Even if a process does have the right capabilities, mounting a new thing inside the container involves (at least temporarily) mounting the device containing the filesystem inside the container, which effectively exposes the whole host filesystem to processes running inside the container.

Did I miss anything @julz?

julz commented 8 years ago

We can mount inside unprivileged containers if we do certain tricks (basically nsenter), but the problem is that within either an unprivileged or a privileged container you are in a different mount namespace, and this means you cannot see any of the files from the host in order to mount them. The hack that Eric posted above is basically the only approach I know to doing bind mounts into a running container, and it's super super super insecure. The reason amounts to #2 above, but let me just underline the scale of the exposure we'd be trying to figure out a way to prevent 'cos I think it's important. We'd be doing two really bad things, firstly we'd be exposing /dev/sdN (where /dev/sdN is the device node containing the whole entire root filesystem of the host, where /etc/passwd is) inside the container. THEN we'd be actually mounting this (not just the bind mounted directory, the whole host filesystem) in the container before bind mounting out the bit we want. Both of these actions are things a regular user can't do, for hopefully obvious reasons, but we'd be doing them for the user. Any process running in the container would then (temporarily, but that's not much comfort) have access to the host root filesystem. We can unmount everything at that point and maybe it's ok because as long as you do the dynamic bind mount before you run the user process maybe nothing goes wrong, but if anything leaks (or if the user finds a way to cause anything to leak) we have a huge exposure. And all this involves a non-trivial amount of pretty awful code to write.

If that's not enough this hack relies on their being an actual device node that we can mount inside the container, so it almost certainly wouldn't work in bosh-lite where such a device doesn't exist (because the root filesystem is aufs).

I understand that doing this via ahead-of-time bind mounts might affect the generality of the diego model in some edge cases, but I feel like given the amount of exposure and complexity we're talking about here that should be seriously considered (are any of the use cases we currently support or that we envisage supporting before we have proper volume management actually using any of these edge cases like a TryAction wrapping a DownloadAction-which-would-become-a-BindMount?)

emalm commented 8 years ago

Stories are in under epic https://www.pivotaltracker.com/epic/show/2200340 in the Diego tracker. We'll plan to make the executor logic more complicated to supply static bind-mounts to Garden at container-creation time.

emalm commented 8 years ago

Talked with @tedsuo this afternoon about the Action modeling as he started https://www.pivotaltracker.com/story/show/108774978, and we're thinking about an alternate approach that's a little more like the proposal from March. To the Task Definition and the DesiredLRP RunInfo models, we would add an optional CachedDownloads field that takes a list of CachedDownload items. These have basically the same guts as the DownloadAction in the caching case: CacheKey, From, To, Artifact, etc. The executor then knows to download those assets in parallel before container creation and supply them as read-only bind-mounted directories on the creation spec. If they fail to download or mount correctly, the container is completed in a failed state.

To preserve backwards compatibility and behavior for old clients, the BBS API adds new endpoints for getting a Task or a DesiredLRP RunInfo and provides the CachedDownloads field only in the responses on the new endpoints. For old endpoints, it translates the CachedDownload items into DownloadActions and puts them as the first set of actions on the Task's (only) Action or the LRP's Setup action.

I'll take a crack at revising the proposal document tonight with the details of this alternate approach, including the validations required of the new CachedDownload model, the new endpoints required and the transformations of the old endpoints, the changes required in the stager and nsync to take advantage of this functionality, and any error cases that seem relevant.

emalm commented 8 years ago

Added an initial draft of that approach in the Alternate Approach: Adding CachedDownloads section.

emalm commented 8 years ago

Implemented alternate approach and shipped in Diego 0.1446.0. Closing.

cloudfoundry / diego-notes

Bind-mounting Downloads #30