EU-EDPS / website-evidence-collector

Project moved to https://code.europa.eu/EDPS/website-evidence-collector ! The tool Website Evidence Collector (WEC) automates the website evidence collection of storage and transfer of personal data. https://edps.europa.eu/press-publications/edps-inspection-software_en
https://code.europa.eu/EDPS/website-evidence-collector
European Union Public License 1.2
425 stars 73 forks source link

Docker build breakes due to unpublished outdated packages in the Alpine repo #43

Open ghost opened 3 years ago

ghost commented 3 years ago

Dear all,

in #42, the following problem was described:

The current Dockerfile contains for some dependencies fixed version numbers with the intention to have a rather reproduceable setup:

RUN apk add --no-cache \
      chromium~=80.0.3987 \
      nss \
      freetype \
      freetype-dev \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn~=1.22.4 \

However, as those versions of chromium and yarn are outdated, they are not distributed anylonger by the Alpine project:

step 4/16 : RUN apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha
 ---> Running in 5ca2fe0d3cde
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
ERROR: unsatisfiable constraints:
  chromium-86.0.4240.111-r0:
    breaks: world[chromium~80.0.3987]
  yarn-1.22.10-r0:
    breaks: world[yarn~1.22.4]
The command '/bin/sh -c apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha' returned a non-zero code: 2

The problem was already brought forward here: https://superuser.com/a/1486407/1039133

Possible options are:

  1. Remove as much as possible the tagging of specific versions. I know we need yarn < 2.0 due to breaking changes. Reproducability would require to build the Docker container once and keep it as long as reproducability is needed.
  2. Change to a different distribution that does not unpublish old packages.
  3. Use alternative Alpine repositories that archive old packages, e.g.: apk add --no-cache --update-cache --repository http://nl.alpinelinux.org/alpine/v3.8/main alsa-lib-dev=1.1.6-r0\ See https://superuser.com/a/1369979 .
vincentcox commented 3 years ago

Awesome for providing the options, they put me in the good direction. I am using this website to check which versions we can use to come as close as possible to the one in the Dockerfile: https://pkgs.alpinelinux.org/packages?name=yarn&branch=v3.10.

So I made the following Dockerfile, based on the one in this repo and applied the necessary changes.

FROM alpine:3.10

LABEL maintainer="Robert Riemann <robert.riemann@edps.europa.eu>"

LABEL org.label-schema.description="Website Evidence Collector running in a tiny Alpine Docker container" \
      org.label-schema.name="website-evidence-collector" \
      org.label-schema.usage="https://github.com/EU-EDPS/website-evidence-collector/blob/master/README.md" \
      org.label-schema.vcs-url="https://github.com/EU-EDPS/website-evidence-collector" \
      org.label-schema.vendor="European Data Protection Supervisor (EDPS)" \
      org.label-schema.license="EUPL-1.2"

# Installs latest Chromium (77) package.
RUN apk add --no-cache --update-cache --repository http://nl.alpinelinux.org/alpine/v3.8/main alsa-lib-dev=1.1.6-r0
RUN apk add  \
      chromium~=77.0.3865 \ 
      nss \
      freetype \
      freetype-dev \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn~=1.16 \
# Packages linked to testssl.sh
      bash procps drill coreutils libidn curl \
# Toolbox for advanced interactive use of WEC in container
      parallel jq grep aha

# Add user so we don't need --no-sandbox and match first linux uid 1000
RUN addgroup --system --gid 1001 collector \
      && adduser --system --uid 1000 --ingroup collector --shell /bin/bash collector \
      && mkdir -p /home/collector/Downloads /output \
      && chown -R collector:collector /home/collector \
      && chown -R collector:collector /output

COPY . /opt/website-evidence-collector/

# Install Testssl.sh
RUN curl -SL https://github.com/drwetter/testssl.sh/archive/3.0.tar.gz | \
      tar -xz --directory /opt

# Run everything after as non-privileged user.
USER collector

WORKDIR /home/collector

# Tell Puppeteer to skip installing Chrome. We'll be using the installed package.
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true

RUN yarn global add file:/opt/website-evidence-collector --prefix /home/collector

# Let Puppeteer use system Chromium
ENV PUPPETEER_EXECUTABLE_PATH /usr/bin/chromium-browser

ENV PATH="/home/collector/bin:/opt/testssl.sh-3.0:${PATH}"
# Let website evidence collector run chrome without sandbox
# ENV WEC_BROWSER_OPTIONS="--no-sandbox"
# Configure default command in Docker container
ENTRYPOINT ["/home/collector/bin/website-evidence-collector"]
WORKDIR /
VOLUME /output

So the changed parts are:

FROM alpine:3.10
....
RUN apk add --no-cache --update-cache --repository http://nl.alpinelinux.org/alpine/v3.8/main alsa-lib-dev=1.1.6-r0
RUN apk add  \
      chromium~=77.0.3865 \ 
....
      yarn~=1.16 \

Build it:

docker build -t website-evidence-collector .

Please note that in the Dockerfile in the repo, the dot is missing in the comments on how to use the dockerfile

Run it:

mkdir output
chmod 777 output # Can cleaner and securer, but for the sake of the poc
docker run --rm -it --cap-add=SYS_ADMIN -v $(pwd)/output:/output website-evidence-collector https://vincentcox.com --overwrite

If you consider this as a feasible fix, I can make a pull request with all the changes (including the ones on how to use and build it).

Hmmm, I just saw you pushed a hotfix https://github.com/EU-EDPS/website-evidence-collector/commit/c5c4b989a1f51d9e12e81b3afa3f9d4ae7ac4230, let me check this out

vincentcox commented 3 years ago

So I am using your Dockerfile, but it gets me stuck at this:

Step 11/16 : RUN yarn global add file:/opt/website-evidence-collector --prefix /home/collector
 ---> Running in 0363b73f8c9a
yarn global v1.22.10
[1/4] Resolving packages...
warning file:/opt/website-evidence-collector > request-promise-native@1.0.9: request-promise-native has been deprecated because it extends the now deprecated request package, see https://github.com/request/request/issues/3142
warning file:/opt/website-evidence-collector > request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
warning file:/opt/website-evidence-collector > request > har-validator@5.1.5: this library is no longer supported
warning file:/opt/website-evidence-collector > pug > pug-code-gen > constantinople > babel-types > babel-runtime > core-js@2.6.12: core-js@<3 is no longer maintained and not recommended for usage due to the number of issues. Please, upgrade your dependencies to the actual version of core-js@3.
[2/4] Fetching packages...
error An unexpected error occurred: "EACCES: permission denied, scandir '/opt/website-evidence-collector/output/browser-profile'".
info If you think this is a bug, please open a bug report with the information provided in "/home/collector/.config/yarn/global/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/global for documentation about this command.
The command '/bin/sh -c yarn global add file:/opt/website-evidence-collector --prefix /home/collector' returned a non-zero code: 1

Any idea why this is happening?

ghost commented 3 years ago

I could reproduce this problem.

Try to delete the folder /opt/website-evidence-collector/output/browser-profil. This solved the issue for me. I do not understand why this folder can break the build process.

vincentcox commented 3 years ago

Ok, it builds now if I add this to the dockerfile:

RUN rm -rf /opt/website-evidence-collector/output/browser-profile

Unfortunately, it's still the same issue as https://github.com/EU-EDPS/website-evidence-collector/issues/42.

Do you have the same issue if you run this?:

docker run --rm -it --cap-add=SYS_ADMIN -v $(pwd)/output:/output website-evidence-collector https://vincentcox.com --overwrite

It takes a lot of time and keeps using more and more ram. It's strange that it also happens with Docker, which should be platform independant. It's not only my website, but sites from a client I am making a dashboard for (unfortunately I can't share it here publicly).

So I'm affraid I'll stick with this one https://github.com/EU-EDPS/website-evidence-collector/issues/43#issuecomment-734236432