Update accumulo-proxy Docker container to be more size efficient

volmasoft commented 4 years ago

During the pull request here: https://github.com/apache/accumulo-proxy/pull/20 @mjwall spotted that we were being a bit inefficient with our container size by storing the tar before extracting it.

This should be cleaned up and ideally done in a single step by updating the download_bin() method.

For consistency sakes we should also ideally do this on the accumulo-docker repo https://github.com/apache/accumulo-docker/blob/master/Dockerfile

volmasoft commented 4 years ago

I have documented this here as an issue as I'll raise an issue on the main accumulo project to do the same work and cross-reference.

Background

Currently we have 4 TAR files in the image, 3 of which are downloaded (Hadoop, Zookeeper, Accumulo) to get their libraries.

The first cache layer is the download layer where we download the 3 files:

RUN set -eux; \
  download_bin() { \
    local f="$1"; shift; \
    local hash="$1"; shift; \
    local distFile="$1"; shift; \
    local success=; \
    local distUrl=; \
    for distUrl in ${APACHE_DIST_URLS}; do \
      if wget -nv -O "/tmp/${f}" "${distUrl}${distFile}"; then \
        success=1; \
        # Checksum the download
        echo "${hash}" "/tmp/${f}" | sha1sum -c -; \
        break; \
      fi; \
    done; \
    [ -n "${success}" ]; \
  };\
   \
   download_bin "apache-zookeeper.tar.gz" "${ZOOKEEPER_HASH}" "zookeeper/zookeeper-${ZOOKEEPER_VERSION}/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz"; \
   download_bin "hadoop.tar.gz" "$HADOOP_HASH" "hadoop/core/hadoop-${HADOOP_VERSION}/hadoop-$HADOOP_VERSION.tar.gz"; \
   download_bin "accumulo.tar.gz" "${ACCUMULO_HASH}" "accumulo/${ACCUMULO_VERSION}/accumulo-${ACCUMULO_VERSION}-bin.tar.gz";

Then we go and untar the files into their correct locations in 3 separate calls:

RUN tar xzf /tmp/hadoop.tar.gz -C /opt/ && ln -s /opt/hadoop-${HADOOP_VERSION} /opt/hadoop
RUN tar xzf /tmp/apache-zookeeper.tar.gz -C /opt/ && ln -s /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin /opt/apache-zookeeper
RUN tar xzf /tmp/accumulo.tar.gz -C /opt/ && ln -s /opt/accumulo-${ACCUMULO_VERSION} /opt/accumulo && sed -i 's/\${ZOOKEEPER_HOME}\/\*/\${ZOOKEEPER_HOME}\/\*\:\${ZOOKEEPER_HOME}\/lib\/\*/g' /opt/accumulo/conf/accumulo-env.sh

Current implementation

Here's some stats for the current implementation.

Before the download Docker image size - 510MB

After the download Docker image size -913MB File sizes on disk - 384MB in /tmp (Accumulo 33M, zookeeper 8.9M, Hadoop 343M)

After the untar and relocation Docker image size - 1.86GB File sizes on disk - 384M in /tmp 964M in /opt

Summary There are two issues:

We don't clean up the files in /tmp after they have been moved to /opt/
By doing the download and untar in different commands
We untar in a separate command to the download, which means our image size has grown to 1.86GB

Suggested Approach

I'm going to update the download_bin() function to download the binaries, extract them and delete the original downloads.

This should a) stop us from invalidating the cache layer and b) reduce the image size by at least the tar file sizes (384MB) which would bring the overall container down to approx 1.5GB, still big but around 22% reduction.

This approach would then be mirrored across to accumulo-docker project to keep consistency.

madrob commented 4 years ago

Isn't this what multi-stage builds are for?

https://docs.docker.com/develop/develop-images/multistage-build/

mjwall commented 4 years ago

Isn't this what multi-stage builds are for?

https://docs.docker.com/develop/develop-images/multistage-build/

You could use a multistage build here but I am not sure it is necessary.

volmasoft commented 4 years ago

@madrob I would usually liken multi-stage builds to things like simplifying or reducing the size of a compilation based tool where you perhaps need a tonne of libraries installed to compile but once compiled the binary is standalone.

This was more about being smart about how we have each cache layer, by splitting the download and untar across two commands we essentially doubled the size.

I'm happy to look at a multistage build approach if you have a good idea of where to draw the line/split? I couldn't spot an easy one that made sense. The only idea I came up with is using a builder to acquire the binaries (hadoop, accumulo, zookeeper, accumulo-proxy) and then using the second image definition to grab these binaries.

I welcome your thoughts though as I'm no expert in multi stage builds, I've used them at work and on home projects a few times but mostly for making consistent compilation environments e.g. with GCC or test tools in, that aren't needed for running the app.

If it helps I did push a branch working on this ticket (https://github.com/apache/accumulo-proxy/pull/23), but I still need to verify something on it, I got very distracted by going down a rabbit hole of seeing if I could get accumulo-proxy to run on an alpine linux backed JDK to see if I could reduce the size more.

Have a gander, see what you think?

Given I'd like to take the same approach for the main accumulo-docker image (it suffers from the same problem) it'd be good to get eyes on this.

volmasoft commented 4 years ago

So it's worth noting, the pull request I uploaded (#22) has brought the generated file size down from 1.86GB to 1.46GB, a reduction of 21.5%

Two further ideas on dropping disk usage, let me know if you think these are worth pursuing?

Remove Hadoop docs

Potential saving: 500M

Hadoop has a tonne of docs which are useful though I wouldn't say they are useful when you're only using Hadoop for its libraries (e.g. in this container environment).

Does anyone know whether it's an acceptable use to install the hadoop release (in this case hadoop-3.2.1) and then remove the hadoop-3.2.1/share/doc folder?

I still would need to conduct some testing but this would potentially save us 499.1M (noting Hadoop's total install size is 897.8M)

Switch to alpine

Potential saving: 400M

I took a quick detour today down to openjdk:8-alpine3.9 rabbit hole.

Turns out (at least on the face of it) that moving to alpine isn't actually too difficult for the accumulo-proxy, the only thing missing is bash (alpine ships with sh).

Whilst I can probably rewrite accumulo-proxy to not need bash and to work in bash or sh environments, the rabbit hole continues when you start looking at things like our use of accumulo classpath as this requires bash (or re-writing)

A quick but not great solution is to take the Alpine version and add bash to it.

I took this for a quick tour and got the image size down to 1.06GB, so that's approx. another 400M saved)

I quickly mocked an example of combining these 3 changes:

My changes in pull #22 to change how we download and untar the dependencies
Switch to alpine jdk
Remove the hadoop docs folder and this resulted in total size of 597M (reduction of 68% compared with today's 1.86GB).

I would expect us to also be able to make the same changes to accumulo-docker.

mjwall commented 4 years ago

Nice. To me it is fine just apt installing bash vs rewriting. Thanks for looking into this @volmasoft

volmasoft commented 4 years ago

@mjwall do you have any views/experience on whether it's acceptable to drop a whole portion of a Hadoop install (the docs) and whether this is restricted due to licensing etc? I am no expert in this area so I don't want to do something that could potentially cause issues.

I'll wrap the alpine change in today and update the pull.

volmasoft commented 4 years ago

@mjwall @keith-turner conscious the pull request is open still here: https://github.com/apache/accumulo-proxy/pull/23

Any further work required? if we're good to merge then I can take a look at the accumulo docker container to keep consistency.

ctubbsii commented 4 years ago

@volmasoft I think there were some outstanding comments made by @keith-turner on the PR #23 that haven't yet been addressed, regarding the preference for the jre-slim Java image. We could probably merge it once all the comments have been addressed.

volmasoft commented 4 years ago

Apologies for the delay, I'll aim to get this boxed off this week.

apache / accumulo-proxy