Open volmasoft opened 4 years ago
I have documented this here as an issue as I'll raise an issue on the main accumulo project to do the same work and cross-reference.
Currently we have 4 TAR files in the image, 3 of which are downloaded (Hadoop, Zookeeper, Accumulo) to get their libraries.
The first cache layer is the download layer where we download the 3 files:
RUN set -eux; \
download_bin() { \
local f="$1"; shift; \
local hash="$1"; shift; \
local distFile="$1"; shift; \
local success=; \
local distUrl=; \
for distUrl in ${APACHE_DIST_URLS}; do \
if wget -nv -O "/tmp/${f}" "${distUrl}${distFile}"; then \
success=1; \
# Checksum the download
echo "${hash}" "/tmp/${f}" | sha1sum -c -; \
break; \
fi; \
done; \
[ -n "${success}" ]; \
};\
\
download_bin "apache-zookeeper.tar.gz" "${ZOOKEEPER_HASH}" "zookeeper/zookeeper-${ZOOKEEPER_VERSION}/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz"; \
download_bin "hadoop.tar.gz" "$HADOOP_HASH" "hadoop/core/hadoop-${HADOOP_VERSION}/hadoop-$HADOOP_VERSION.tar.gz"; \
download_bin "accumulo.tar.gz" "${ACCUMULO_HASH}" "accumulo/${ACCUMULO_VERSION}/accumulo-${ACCUMULO_VERSION}-bin.tar.gz";
Then we go and untar the files into their correct locations in 3 separate calls:
RUN tar xzf /tmp/hadoop.tar.gz -C /opt/ && ln -s /opt/hadoop-${HADOOP_VERSION} /opt/hadoop
RUN tar xzf /tmp/apache-zookeeper.tar.gz -C /opt/ && ln -s /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin /opt/apache-zookeeper
RUN tar xzf /tmp/accumulo.tar.gz -C /opt/ && ln -s /opt/accumulo-${ACCUMULO_VERSION} /opt/accumulo && sed -i 's/\${ZOOKEEPER_HOME}\/\*/\${ZOOKEEPER_HOME}\/\*\:\${ZOOKEEPER_HOME}\/lib\/\*/g' /opt/accumulo/conf/accumulo-env.sh
Here's some stats for the current implementation.
Before the download Docker image size - 510MB
After the download Docker image size -913MB File sizes on disk - 384MB in /tmp (Accumulo 33M, zookeeper 8.9M, Hadoop 343M)
After the untar and relocation Docker image size - 1.86GB File sizes on disk - 384M in /tmp 964M in /opt
Summary There are two issues:
I'm going to update the download_bin() function to download the binaries, extract them and delete the original downloads.
This should a) stop us from invalidating the cache layer and b) reduce the image size by at least the tar file sizes (384MB) which would bring the overall container down to approx 1.5GB, still big but around 22% reduction.
This approach would then be mirrored across to accumulo-docker project to keep consistency.
Isn't this what multi-stage builds are for?
https://docs.docker.com/develop/develop-images/multistage-build/
Isn't this what multi-stage builds are for?
https://docs.docker.com/develop/develop-images/multistage-build/
You could use a multistage build here but I am not sure it is necessary.
@madrob I would usually liken multi-stage builds to things like simplifying or reducing the size of a compilation based tool where you perhaps need a tonne of libraries installed to compile but once compiled the binary is standalone.
This was more about being smart about how we have each cache layer, by splitting the download and untar across two commands we essentially doubled the size.
I'm happy to look at a multistage build approach if you have a good idea of where to draw the line/split? I couldn't spot an easy one that made sense. The only idea I came up with is using a builder to acquire the binaries (hadoop, accumulo, zookeeper, accumulo-proxy) and then using the second image definition to grab these binaries.
I welcome your thoughts though as I'm no expert in multi stage builds, I've used them at work and on home projects a few times but mostly for making consistent compilation environments e.g. with GCC or test tools in, that aren't needed for running the app.
If it helps I did push a branch working on this ticket (https://github.com/apache/accumulo-proxy/pull/23), but I still need to verify something on it, I got very distracted by going down a rabbit hole of seeing if I could get accumulo-proxy to run on an alpine linux backed JDK to see if I could reduce the size more.
Have a gander, see what you think?
Given I'd like to take the same approach for the main accumulo-docker image (it suffers from the same problem) it'd be good to get eyes on this.
So it's worth noting, the pull request I uploaded (#22) has brought the generated file size down from 1.86GB to 1.46GB, a reduction of 21.5%
Two further ideas on dropping disk usage, let me know if you think these are worth pursuing?
Potential saving: 500M
Hadoop has a tonne of docs which are useful though I wouldn't say they are useful when you're only using Hadoop for its libraries (e.g. in this container environment).
Does anyone know whether it's an acceptable use to install the hadoop release (in this case hadoop-3.2.1) and then remove the hadoop-3.2.1/share/doc folder?
I still would need to conduct some testing but this would potentially save us 499.1M (noting Hadoop's total install size is 897.8M)
Potential saving: 400M
I took a quick detour today down to openjdk:8-alpine3.9 rabbit hole.
Turns out (at least on the face of it) that moving to alpine isn't actually too difficult for the accumulo-proxy, the only thing missing is bash (alpine ships with sh).
Whilst I can probably rewrite accumulo-proxy to not need bash and to work in bash or sh environments, the rabbit hole continues when you start looking at things like our use of accumulo classpath
as this requires bash (or re-writing)
A quick but not great solution is to take the Alpine version and add bash to it.
I took this for a quick tour and got the image size down to 1.06GB, so that's approx. another 400M saved)
I quickly mocked an example of combining these 3 changes:
I would expect us to also be able to make the same changes to accumulo-docker.
Nice. To me it is fine just apt installing bash vs rewriting. Thanks for looking into this @volmasoft
@mjwall do you have any views/experience on whether it's acceptable to drop a whole portion of a Hadoop install (the docs) and whether this is restricted due to licensing etc? I am no expert in this area so I don't want to do something that could potentially cause issues.
I'll wrap the alpine change in today and update the pull.
@mjwall @keith-turner conscious the pull request is open still here: https://github.com/apache/accumulo-proxy/pull/23
Any further work required? if we're good to merge then I can take a look at the accumulo docker container to keep consistency.
@volmasoft I think there were some outstanding comments made by @keith-turner on the PR #23 that haven't yet been addressed, regarding the preference for the jre-slim Java image. We could probably merge it once all the comments have been addressed.
Apologies for the delay, I'll aim to get this boxed off this week.
During the pull request here: https://github.com/apache/accumulo-proxy/pull/20 @mjwall spotted that we were being a bit inefficient with our container size by storing the tar before extracting it.
This should be cleaned up and ideally done in a single step by updating the download_bin() method.
For consistency sakes we should also ideally do this on the accumulo-docker repo https://github.com/apache/accumulo-docker/blob/master/Dockerfile