NVIDIA / container-canary

A tool for testing and validating container requirements against versioned manifests
Apache License 2.0
245 stars 15 forks source link

feature request: allow use of '--network host' #63

Closed jameslamb closed 3 months ago

jameslamb commented 3 months ago

Description

I'd like to be able to convince container-canary to run containers with --network host.

@trxcllnt and I have been using container-canary in a docker-in-docker context in CI, and he observed some networking instability in that environment that might be be resolved by avoiding double-nested network isolation.

Benefits of this work

Allows finer-grained control over how container-canary is run.

If this does seem to help for the docker-in-docker case specifically, would improve container-canary's usability in a common pattern used in CI.

Acceptance Criteria

Approach

Options Considered

I see 3 options for exposing this configuration, @jacobtomlinson I'll happily defer to your opinion on which to pursue (assuming you're open to this idea generally).


Option 1: hostNetwork

The k8s V1 podSpec has an option hostNetwork, for exactly this purpose. (k8s docs).

hostNetwork (boolean)

Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false.

So that'd look like the container-canary config picking up a new boolean option:

hostNetwork: true

Option 2: named network argument

container-canary could instead support passing any network name via a top-level config argument.

network: host

This would allow greater customization, of the form described in "User-defined networks" (Docker docs).


Option 3: arbitrary array of extra arguments

container-canary could support passing arbitrary flags to the container runtime via an array.

runOptions:
  --network: host
  --gpus: all

Where the implication is that for each key: value pair there, {key} {value} is passed to docker run. Similar to this:

https://github.com/NVIDIA/container-canary/blob/c47389618e547db3d572ddfd62e3357449880ec9/internal/container/docker.go#L41-L47

GitHub Actions has a similar interface, where there are some container characteristics that are set via top-level configuration, and then any other arbitrary flags can be passed via jobs.<job_id>.container.options (GitHub Actions docs)


Proposal

I (@jameslamb) personally prefer option 3. It offers high flexibility to experiment, so could add a lot of capability to container-canary all at once (beyond just --network host).

For example, it'd immediately also add support for running tests that require GPUs (you could just pass --gpus all).

And having the arguments be keys in the YAML map means that if, in the future, we wanted to have container-canary prohibit setting some option or require that it be set by some other top-level configuration value, it'd be straightforward to issue deprecation warnings and eventually errors like "Passing --gpus via runOptions is now allowed. Please use top-level key gpus: instead".

trxcllnt commented 3 months ago

+1 for this, except make runOptions a list of arguments passed directly to docker run, .e.g runOptions: [--network host --gpus all]. This provides maximum flexibility and maps directly to the CLI.

jacobtomlinson commented 3 months ago

This seems like a good goal to me.

One thing I want to avoid is leaking the implementation details around canary using docker under the hood. I tried to build things in a way where we could support podman or nerdctl (or whatever comes next) in the future. So I don't want to add anything docker specific to the configuration. In this case I would lean towards option 1 because it both copies the Kubernetes API and also allows it to be implemented under the hood for each runtime.

However, I've seen other projects like kind try and maintain the same kind of disconnect between the user API and the implementation and result in things being hard or impossible (like setting the --gpus=all flag). So I don't want the goal of implementation detail purity to work against us.

I also quite like option 3 because it would just allow passing arbitrary flags to the CLI under the hood, and I like @trxcllnt's suggestion of making it a simple list. However I worry a little about a future where we want to support podman and all of these manifests have docker specific flags set in them.

Perhaps the way to deal with this us to just expose the implementation directly and have something like dockerRunOptions which allows is to add podmanRunOptions in the future. The only downside to this is that people will end up making manifests that are tailored to a certain runtime and will likely not work with a different one.

Any thoughts on a solution to this?

jameslamb commented 3 months ago

Perhaps the way to deal with this us to just expose the implementation directly and have something like dockerRunOptions which allows is to add podmanRunOptions in the future

I think we should pursue this dockerRunOptions approach, and making it a list (instead of a map) is fine with me.

The idea to namespace by container runtime is a good one. There are only a small number of container runtimes I can imagine this tool ever needing to support, so the proliferation of those {runtime}RunOptions shouldn't be too bad.

And with all of this defined in static configuration, there's a path available to use warnings / errors to move configs over to top-level (portable-across-runtimes) options as new ones are added. Similar to the the example I gave above, by raising warnings / errors like this:

"[WARN] Passing --gpus via dockerRunOptions is not allowed. Please use top-level key gpus: instead".

I like this the most because it allows for experimentation. In the absence of this, right now to test if --network host even does actually fix the particular CI that inspired this issue, I'd have to modify container-canary's source code, rebuild it, and then use that built-from-source version to test. With the ability to pass through arbitrary flags, that's no longer necessary.

jacobtomlinson commented 3 months ago

That sounds totally reasonable to me. Let's go with option 3 and in the future move things that we know people want to option 1 for better portability.

jacobtomlinson commented 3 months ago

I just published v0.4.0 which includes option 3.

jameslamb commented 3 months ago

Excellent, thank you so much!