cncf-tags / container-device-interface

Apache License 2.0
216 stars 39 forks source link

Add support for predefined hooks #203

Open deitch opened 7 months ago

deitch commented 7 months ago

I have been working with the NVIDIA container toolkit which generates a CDI yaml file, including hooks. The hooks all call nvidia-ctk to do 3 things:

These are pretty basic standard activities. It seems like it would make sense to have containerEdits within the spec capable of handling these. The spec includes container edit things like mounts, env, deviceNodes, additionalGIDs, and of course hooks. I think that creating symlinks, changing permissions on dir/file, and maybe updating ldcache (I am mixed on that one) are the kinds of capabilities that would fit as first-class citizens within containerEdits.

Thoughts?

reidpr commented 6 months ago

I’m working on CDI for Charliecloud, a lightweight, fully-unprivileged container implementation for HPC applications (hpc/charliecloud#1902). (In fact, it looks like @elezar was already helpful in some offline correspondence.) This would also be useful for us.

In fact, regardless of what the standard says, I anticipate not running nvidia-ctk hook to do these things, but rather interpreting such clauses as @deitch suggests and just do the things internally. Such a kludge seems better to me than calling external programs for these “basic standard” tasks, but I’d much rather follow documented behavior.

I would be happy to provide a draft PR for folks to look at.

elezar commented 6 months ago

@reidpr we did discuss this in our latest working group meeting (notes here) and will most likely not be implementing this as part of the spec for the time being.

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders. Here we need to make provision for different paths to ldconfig on the host (e.g. on ubuntu the correct path is ldconfig.real since ldconfig is a shell wrapper) as well as containers that don't have an ldcache (e.g. busybox).

If we need to capture this logic in some kind of binary, this still needs to be available on the host and would complicate CDI spec consumption since we would have to distribute this binary somehow.

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do so that their behaviour can be replicated across various environments. This does place the burden on the hook vendors to provide this documentation though and cannot necessarily be enforced by the spec.

One question I have is whether there is some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour? That is to say that as a vendor, NVIDIA would still distribute the nvidia-cdi-hook binary that would include an update-ldcache subcommand, but that the pre-and post conditions for this command are defined by a public spec. As the spec owners, we could even provide a set of test cases to validate this. This should in theory allow hooks to be more interchangeable and allow users to determine whether hooks are applicable given their environments.

deitch commented 6 months ago

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders.

I can see that as a concern. At the same time, I do think it is a narrowly prescribed set of activities. I originally thought 3, now there is a variant on 1, so total of 4.

Would we want to implement the truly standard ones, i.e. chmod and symlinks?

elezar commented 6 months ago

For clarification, the chmod hook that was added to nvidia-ctk was to workaround a crun bug and should be removed from this discussion.

reidpr commented 6 months ago

Thank you @elezar.

Take a hook that "updates the ldcache" as an example ... we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER .... Here we need to make provision for different paths to ldconfig on the host ...

I see your point. I wasn’t in that discussion obviously, but in my view that type of logic belongs more in the container implementation rather than vendor hooks. However you’ve been thinking about this longer than me.

I have a couple follow-ups.

Can you clarify the reasoning on why NVIDIA (or HPE, or ...) should be figuring out host ldconfig stuff rather than Charliecloud (or Podman, or ...)? What am I missing?

What if the host does not have ldconfig? (E.g., Alpine is based on musl, which doesn’t.)

as well as containers that don't have an ldcache...

Good point.

Do these containers need ldconfig run at all, though?

Is it reasonably common to have containers without an ldconfig executable within the container but still needing an ldcache update?

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do .... [I]s some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour?

I could work with this.

I really like the declarative aspect of CDI. Strictly prescribed hooks would be consistent with that, though (in my view) in kind of a roundabout way.

reidpr commented 6 months ago

latest working group meeting (notes here)

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

elezar commented 6 months ago

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

Yes, the meetings are public. All the required details should be in the linked document. Ideally they are recorded too, but there is sometimes some lag between when these are available on YouTube.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

deitch commented 3 months ago

I will comment to keep it open.

elezar commented 3 months ago

I have created #225 as an initial attempt to capture the intent of the "predefined" hooks.

deitch commented 3 months ago

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

deitch commented 1 week ago

Not stale.

elezar commented 1 week ago

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

The chmod hook was added to our tooling as a workaround for a particular runc bug. I don't think we should formalize this and rather focus on the remaining hooks.