artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Switch to a Ubuntu docker base image #1005

Closed djjuhasz closed 2 months ago

djjuhasz commented 3 months ago

The Ubuntu base image allows shell access to a running continer for debugging purposes, unlike the Google distroless image.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Please upload report for BASE (main@980f78e). Learn more about missing BASE report. Report is 13 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1005 +/- ## ======================================= Coverage ? 53.12% ======================================= Files ? 102 Lines ? 5835 Branches ? 0 ======================================= Hits ? 3100 Misses ? 2478 Partials ? 257 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sevein commented 3 months ago

Have you considered using ephemeral containers? They're available in k8s and compose:

djjuhasz commented 3 months ago

@sevein I haven't considered using ephemeral containers because I didn't know they existed. It sounds like a good solution, but I'm not sure how to get it working with our k3d/Tilt dev environment. I just tried:

kubectl debug -it enduro-am-5dcb8756ff-4vtfr --image=busybox --target=enduro-am-worker

And I get a shell :tada: but when I do ls /home the directory is empty (I expected a /home/enduro directory).

I found https://github.com/k3d-io/k3d/discussions/885 which discusses how to get ephemeral containers working with k3d, but I don't really understand the details. It sounds like we may need to create a custom k3d config file that adds the ephemeral container feature gate, but I'm not sure if this actually the problem with my attempt to get a debug shell in the enduro-am container. :confused: Any suggestions?

sevein commented 3 months ago

I haven't tried them myself, but /home/enduro must be an asset of the enduro-am-worker image, but you're using busybox. Not sure if that will work.

djjuhasz commented 3 months ago

@sevein one of my main use cases for wanting a shell is to be able to examine the local copy of a package and make sure the contents look correct. Is it possible to do that kind of thing with an ephemeral debug container? The reason I used the ubuntu base image is so I could shell into the enduro-am worker to check if I could sftp to my host from inside the container, but the busybox image doesn't appear to have an sftp client. :(

I guess I don't really understand the reason for using a distroless container in our dev environment. The reduced attack surface isn't really a concern I don't think because there is no outside access to the environment. The distroless image is smaller (20.7 MB) than the Ubuntu 22.04 image (77.9MB) but I'm not really concerned about an extra 50MB.

On the other side shell access seems like a big positive for a development environment to be able to inspect the internal state of the running containers.

sevein commented 3 months ago

I understand. I think that we'd want to use the distroless image in production, but also using the same image in both prod and dev is a nice to have - if affordable, but if this is slowing you down then maybe it could be addressed in the future.

@sevein one of my main use cases for wanting a shell is to be able to examine the local copy of a package and make sure the contents look correct. Is it possible to do that kind of thing with an ephemeral debug container?

It looks like it's possible, but you'd need to use a shared volume (e.g. see this deployment).

The reason I used the ubuntu base image is so I could shell into the enduro-am worker to check if I could sftp to my host from inside the container, but the busybox image doesn't appear to have an sftp client. :(

You could maybe use ubuntu:latest instead of busybox and install whatever package you need.

djjuhasz commented 3 months ago

Okay thanks for your ideas @sevein. I'll talk it over with @jraddaoui when he's back next week and see if he want to stick with distroless or switch to ubuntu (or another base image with glibc and a shell).

jraddaoui commented 3 months ago

@djjuhasz @sevein, hard topic, I didn't know about ephemeral containers nor kubectl debug either, but I gave it a try following this example. I got pretty close but, when you add a PVC and volumes in the /home directory to the containers, it overwrites the contents of that directory, including Enduro's binary and configuration. I also tried with kubectl cp to copy and check the contents locally, but that requires the 'tar' binary in the container.

Why would you need to check the contents of the home directory? AFAIK Enduro's binary and configuration are the only things in there. It could be possible if we want to share and inspect the contents from another directory.


With the ephemeral containers I could see the running enduro-am-worker process:

$ kubectl debug -it enduro-am-7c84797fc9-7tmb9 --image=busybox:1.28 --target=enduro-am-worker
Targeting container "enduro-am-worker". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-5vvvn.
If you don't see a command prompt, try pressing enter.
/ # ps
PID   USER     TIME  COMMAND
    1 1000      0:03 /home/enduro/bin/enduro-am-worker --config /home/enduro/.config/enduro.toml
   40 root      0:00 sh
   46 root      0:00 ps

But I don't see how that could be used to test the SFTP connection, not even in a container with a shell, you'd still need to install the client and set it up. At that point, if you just want to test it works from within the cluster, maybe you could create a temporary image/pod, unrelated to the enduro-am-worker, but with similar limitations.


Using distroless also worries me about needing other dependencies, thinking about xmllint for XSD validation in the projects were we are planning to use it or any other dependency like that will require a multi-stage approach were things are installed and copied from another stage.

In any case, I agree with @djjuhasz about having a shell in the development env. and I also agree with @sevein about using distroless in production when possible. It's not that I really like this solution but we could have development targets in the Dockerfile, we already do it with the dashboard, targeting the builder stage instead of the final one to be able to use autoload and live updates:

We could add a base-dev stage using debian:12-slim or similar and create three enduro*-dev targets from it. One of the product trio goals for this quarter is to improve CD/CI in this project, I couldn't put much time lately but I hope to do that soon, which would help testing the distroless production images. Again, not a big fan, but what do you think?

djjuhasz commented 3 months ago

@jraddaoui I'm open using debian:12-slim but just FYI, the Ubuntu 20.04 images are pretty much the exact same size:

Screenshot from 2024-08-19 14-34-24 Screenshot from 2024-08-19 14-33-53

I'm also open to switching back to Alpine Linux, installing python3 and bagit-python, and creating a simple command line wrapper to do bag validation. My experience with the bagit-gython embedded python experiment is that it adds significant complexity without any major benefits over installing and wrapping the Python script.

djjuhasz commented 3 months ago

@jraddaoui as to why I want access to the "/home" directory, I'm trying to access a failed SIP at /home/a3m/.local/share/a3m/share/failed to figure out why the a3m "verify checksum" task failed. I'm just trying to get access to that file again with the distroless container, and I can't figure out how to get access to the filesystem or the contents. As you mentioned above kubectl cp requires tar, which isn't installed in the distroless container. :(

jraddaoui commented 3 months ago

You could do that from the a3m container, it uses Ubuntu 22.04:

https://github.com/artefactual-labs/a3m/blob/main/Dockerfile

djjuhasz commented 3 months ago

You could do that from the a3m container, it uses Ubuntu 22.04:

https://github.com/artefactual-labs/a3m/blob/main/Dockerfile

Ah, clever @jraddaoui! Thanks, that's a good temporary workaround until we decide on a more permanent solution.

djjuhasz commented 2 months ago

Closing this for now. We may chose to re-open it in the future.