NVIDIA / container-canary

A tool for testing and validating container requirements against versioned manifests
Apache License 2.0
245 stars 15 forks source link

Make container startup timeout configurable #65

Closed jameslamb closed 3 months ago

jameslamb commented 3 months ago

Fixes #61

Contributes to https://github.com/rapidsai/docker/issues/667

container-canary runs a container and then executes commands inside it. When it starts up that container, if it doesn't see the container come up within 10 seconds, it raises an exception.

This proposes making that timeout configurable, via a CLI argument --startup-timeout. This will make it easier to use container-canary with images / environments where container starting is slower (e.g. where something expensive is done at startup or where local disks are slow).

jameslamb commented 3 months ago

I think that these changes are working as intended, but I'm struggling to test them.

I tested that the configuration value is getting all the way down to the place where it needs to be by adding a print statement, like this:

return fmt.Errorf("-- timeoutSeconds: %d", timeoutSeconds)
if time.Since(startTime) > (time.Second * time.Duration(timeoutSeconds)) {
   ...

And then doing this:

make clean testprep build

bin/canary validate \
  --file ./examples/kubeflow.yaml \
  --startup-timeout 5 \
  container-canary/kubeflow:shouldpass

bin/canary validate \
  --file ./examples/kubeflow.yaml \
  --startup-timeout 9 \
  container-canary/kubeflow:shouldpass
Error: -- timeoutSeconds: 5
Error: -- timeoutSeconds: 5

Error: -- timeoutSeconds: 9
Error: -- timeoutSeconds: 9

So this is at least as correct as the existing code.

But I'm not sure how to actually test the enforcement of the timeout. This startup timeout applies to the time between when docker run -d returns, here:

https://github.com/NVIDIA/container-canary/blob/2e70e2bb289c6e5a5180c8deefd86f2150256343/internal/container/docker.go#L63

and when the container is first reported as "running", here:

https://github.com/NVIDIA/container-canary/blob/2e70e2bb289c6e5a5180c8deefd86f2150256343/internal/container/docker.go#L67

I'm not sure how to easily influence the amount of time between those two states. That does not include any of these:

The only things I can think of that cause this startup timeout to be exceeded are like this:

@jacobtomlinson @KyleFromNVIDIA can you think of a good way to reliably test this? Or should we just merge this as-is without testing changes?

jameslamb commented 3 months ago

This can be either a Go interface or a shim command that gets substituted in place of docker, or something else.

Thanks for these ideas, they're good ones I hadn't considered. A shim for docker, like you did over in rapids-build-backend for nvcc (code link), sounds like it could be a powerful way to test more of how this tool interacts with the container runtime it's shelling out to.

I think that's way beyond my Go abilities though 😬

I agree with you about this being small and self-explanatory, and I'm also confident it could just be merged as-is. Let's see what @jacobtomlinson thinks.

jameslamb commented 3 months ago

Alright sounds good, thanks!