WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
235 stars 186 forks source link

Fix `audiowaveform` isolation issues in API dockerfile #4680

Open sarayourfriend opened 1 month ago

sarayourfriend commented 1 month ago

Current Situation

Despite realies/audiowaveform building audiowaveform using BUILD_STATIC=1, the output is still a dynamically linked binary. From within the latest image:

/ # ldd /usr/local/bin/audiowaveform
        /lib/ld-musl-x86_64.so.1 (0x7f17b78f9000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x7f17b765a000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7f17b7636000)
        libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f17b78f9000)

The API Dockerfile copies these four libraries directly from the realies/audiowaveform image and overlays them on the base system. The audiowaveform binary we copy out of that image was built and tested with those libraries. However, as noted in our API Dockerfile, because we run apt update after copying these libraries, audiowaveform may not necessarily be running with the version of those libraries it was built and tested against.

Suggested Improvement

I propose we resolve this once and for all as follows:

Instead of copying the binary as built from realies/audiowaveform, download the Debian Bookworm .deb (the 1.10.1-1-12_<arch>.deb one, 12 indicating the Debian version) from https://github.com/bbc/audiowaveform/releases/tag/1.10.1 and install it alongside other system dependencies in our existing RUN apt [...] line. We should also specify the Bookworm variant of the python -slim tag, even though that is the default. The correct .deb needs to be pulled based on the architecture of the build host.

The main complexity of this is ensuring the correct architecture .deb is downloaded depending on the build host. On the other hand, it would allow us to delete a huge portion of the API Dockerfile dedicated to copying (and making it possible to copy) the audiowaveform binary out of realies/audiowaveform. It also removes an intermediate middle dependency between us and audiowaveform.

Benefit

Resolve this minor edge case in API build stability.

Additional context

I also considered the following alternatives:

  1. Continue copying the binary, but isolate it and its libraries in /opt/audiowaveform, and then set LD_LIBRARY_PATH when invoking audiowaveform. Do this without needing to modify the Python code that calls the binary by creating a shim /usr/local/bin/audiowaveform along the lines of...: LD_LIBRARY_PATH="/opt/audiowaveform/lib:/opt/audiowaveform/usr/lib" /opt/audiowaveform/usr/local/bin/audiowaveform "$@"
  2. Switch from the -slim to -alpine variant for the final Python image. The (partially) "static" build only leaves out standard library dependencies. These are essentially never going to change unless audiowaveform changes to a new language. We could therefore assume that so long as we ensure libstdc++ and libgcc are installed with apk when we update the base system (trivial), we wouldn't need to copy these final linked libraries anymore. libc.musl will already be in the alpine base system.

The first of these alternatives is, in theory, a smaller change, but might take some fiddling (I haven't actually tried it but it should work in theory as long as ld works the way it should!). The second might work, but I don't know whether the rest of our Python application runs on Alpine without issue.

Overall, the approach I chose to propose moving forward with is the simplest both in the immediate way it can be implemented and in that it significantly simplifies our API Dockerfile. It is the safest and cleanest option, all things considered.

dhruvkb commented 1 month ago

audiowaveform may not necessarily be running with the version of those libraries it was built and tested against

The comment this links to is outdated. We are no longer using latest for the waveform image but rather a pinned version.

https://github.com/WordPress/openverse/blob/d9215a2d2141ce4d0de4b3c6f5795be2b459770c/api/Dockerfile#L11

Is the build process still unreliable, if we've pinned the version? To clarify I am not opposed to (rather very much in favour of) the idea, I only want to know more about the rationale behind the change.

sarayourfriend commented 1 month ago

It's unreliable if apt update or a base system update (e.g., new Debian version) includes g++ changes incompatible with whatever audiowaveform:1.10.1 was compiled against.

The unreliability isn't from selecting the dependencies or the locations of things, it's from the fact that the binary is dynamically linked to libraries that it was never validated against (and in particular built versions of those libraries from a distribution almost explicitly incompatible with the Debian approach to things).