Open deitch opened 7 months ago
I’m working on CDI for Charliecloud, a lightweight, fully-unprivileged container implementation for HPC applications (hpc/charliecloud#1902). (In fact, it looks like @elezar was already helpful in some offline correspondence.) This would also be useful for us.
In fact, regardless of what the standard says, I anticipate not running nvidia-ctk hook
to do these things, but rather interpreting such clauses as @deitch suggests and just do the things internally. Such a kludge seems better to me than calling external programs for these “basic standard” tasks, but I’d much rather follow documented behavior.
I would be happy to provide a draft PR for folks to look at.
@reidpr we did discuss this in our latest working group meeting (notes here) and will most likely not be implementing this as part of the spec for the time being.
One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache
command (or nvidia-cdi-hook update-ldcache
with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig
from the HOST to update /etc/ld.so.cache
IN THE CONTAINER using the specified folders. Here we need to make provision for different paths to ldconfig
on the host (e.g. on ubuntu the correct path is ldconfig.real
since ldconfig
is a shell wrapper) as well as containers that don't have an ldcache (e.g. busybox).
If we need to capture this logic in some kind of binary, this still needs to be available on the host and would complicate CDI spec consumption since we would have to distribute this binary somehow.
Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do so that their behaviour can be replicated across various environments. This does place the burden on the hook vendors to provide this documentation though and cannot necessarily be enforced by the spec.
One question I have is whether there is some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour? That is to say that as a vendor, NVIDIA would still distribute the nvidia-cdi-hook
binary that would include an update-ldcache
subcommand, but that the pre-and post conditions for this command are defined by a public spec. As the spec owners, we could even provide a set of test cases to validate this. This should in theory allow hooks to be more interchangeable and allow users to determine whether hooks are applicable given their environments.
One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders.
I can see that as a concern. At the same time, I do think it is a narrowly prescribed set of activities. I originally thought 3, now there is a variant on 1, so total of 4.
Would we want to implement the truly standard ones, i.e. chmod and symlinks?
For clarification, the chmod
hook that was added to nvidia-ctk
was to workaround a crun
bug and should be removed from this discussion.
Thank you @elezar.
Take a hook that "updates the ldcache" as an example ... we take this to mean, run
ldconfig
from the HOST to update/etc/ld.so.cache
IN THE CONTAINER .... Here we need to make provision for different paths toldconfig
on the host ...
I see your point. I wasn’t in that discussion obviously, but in my view that type of logic belongs more in the container implementation rather than vendor hooks. However you’ve been thinking about this longer than me.
I have a couple follow-ups.
Can you clarify the reasoning on why NVIDIA (or HPE, or ...) should be figuring out host ldconfig
stuff rather than Charliecloud (or Podman, or ...)? What am I missing?
What if the host does not have ldconfig
? (E.g., Alpine is based on musl, which doesn’t.)
as well as containers that don't have an ldcache...
Good point.
Do these containers need ldconfig
run at all, though?
Is it reasonably common to have containers without an ldconfig
executable within the container but still needing an ldcache update?
Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do .... [I]s some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour?
I could work with this.
I really like the declarative aspect of CDI. Strictly prescribed hooks would be consistent with that, though (in my view) in kind of a roundabout way.
latest working group meeting (notes here)
Are these meetings open to the public? It sounds like something I’d be interested in helping with.
Are these meetings open to the public? It sounds like something I’d be interested in helping with.
Yes, the meetings are public. All the required details should be in the linked document. Ideally they are recorded too, but there is sometimes some lag between when these are available on YouTube.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
I will comment to keep it open.
I have created #225 as an initial attempt to capture the intent of the "predefined" hooks.
Cool, thank you @elezar . What do we think about chmod
? Or is that just an artifact going away anyways, as you commented above?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
Not stale.
Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?
The chmod hook was added to our tooling as a workaround for a particular runc bug. I don't think we should formalize this and rather focus on the remaining hooks.
I have been working with the NVIDIA container toolkit which generates a CDI yaml file, including hooks. The hooks all call
nvidia-ctk
to do 3 things:These are pretty basic standard activities. It seems like it would make sense to have
containerEdits
within the spec capable of handling these. The spec includes container edit things likemounts
,env
,deviceNodes
,additionalGIDs
, and of coursehooks
. I think that creating symlinks, changing permissions on dir/file, and maybe updating ldcache (I am mixed on that one) are the kinds of capabilities that would fit as first-class citizens withincontainerEdits
.Thoughts?