OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
105 stars 29 forks source link

Creating a Docker image with Open Terms Archive #1001

Closed fabianospinelli closed 1 year ago

fabianospinelli commented 1 year ago

Dear all, I'm trying to create a Docker image with Open Terms Archive for the Joint Research Centre in Ispra (VA). First of all, I installed it on a Mac machine and everything worked very well. I tested it with some declarations files and I haven't encountered any kind of problem. The problems arose when I tried to build a Docker image. I prepared this Dockerfile starting from the latest version of Ubuntu:


WORKDIR /root

RUN apt-get update \ 
    && apt-get upgrade -y \
    && apt install -y curl \
    && apt install -y git \ 
    && apt-get install -y chromium-browser

RUN curl -fsSL https://deb.nodesource.com/setup_19.x | bash - \
    && apt install -y nodejs

RUN mkdir /root/open-terms-archive \
 && mkdir /root/open-terms-archive/declarations \
 && mkdir /root/open-terms-archive/config \
 && cd /root/open-terms-archive

RUN npm install --save @opentermsarchive/engine

COPY declarations.json /root/open-terms-archive/declarations

COPY default.json /root/open-terms-archive/config

After the built process, trying to run the Open Terms Archive engine, I obtained this error:

2023-03-24 08:24:56 info                                                                              Start Open Terms Archive

2023-03-24 08:24:56 info                                                                              Examining 1 documents from 1 services for refiltering…
2023-03-24 08:24:56 info                                                                              Examined 1 documents from 1 services for refiltering
2023-03-24 08:24:56 info                                                                              Recorded 0 new versions

2023-03-24 08:24:56 info                                                                              Tracking changes of 1 documents from 1 services…
2023-03-24 08:24:56 error                                                                             unhandledRejection: Failed to launch the browser process!
/root/node_modules/puppeteer/.local-chromium/linux-1002410/chrome-linux/chrome: error while loading shared libraries: libnss3.so: cannot open shared object file: No such file or directory

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

Error: Failed to launch the browser process!
/root/node_modules/puppeteer/.local-chromium/linux-1002410/chrome-linux/chrome: error while loading shared libraries: libnss3.so: cannot open shared object file: No such file or directory

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/root/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:255:20)
    at Interface.<anonymous> (/root/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:248:68)
    at Interface.emit (node:events:524:35)
    at Interface.close (node:internal/readline/interface:534:10)
    at Socket.onend (node:internal/readline/interface:260:10)
    at Socket.emit (node:events:524:35)
    at endReadableNT (node:internal/streams/readable:1359:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

So, I tried to verify which libraries are missing with the following command:

    libnss3.so => not found
    libnssutil3.so => not found
    libsmime3.so => not found
    libnspr4.so => not found
    libatk-1.0.so.0 => not found
    libatk-bridge-2.0.so.0 => not found
    libcups.so.2 => not found
    libdrm.so.2 => not found
    libxkbcommon.so.0 => not found
    libXcomposite.so.1 => not found
    libXdamage.so.1 => not found
    libXfixes.so.3 => not found
    libXrandr.so.2 => not found
    libgbm.so.1 => not found
    libpango-1.0.so.0 => not found
    libcairo.so.2 => not found
    libasound.so.2 => not found
    libatspi.so.0 => not found

So, I installed all the missing libraries with the following commands (the idea is to integrate in the Dockerfile):

apt-get install -y libatk1.0-0
apt-get install -y libatk-bridge2.0-0
apt-get install -y libcups2
apt-get install -y libxkbcommon-x11-0
apt-get install -y libxcomposite-dev
apt-get install -y libxdamage1
apt-get install -y libxrandr2
apt-get install -y libpangocairo-1.0-0
apt-get install -y libasound2
apt-get install -y libgbm-dev

At this point, if I try to run the Open Terms Archive engine I obtain another error and I cannot understand how to resolve it. Could you help me?

2023-03-24 08:40:38 info                                                                              Start Open Terms Archive

2023-03-24 08:40:38 info                                                                              Examining 1 documents from 1 services for refiltering…
2023-03-24 08:40:38 info                                                                              Examined 1 documents from 1 services for refiltering
2023-03-24 08:40:38 info                                                                              Recorded 0 new versions

2023-03-24 08:40:38 info                                                                              Tracking changes of 1 documents from 1 services…
2023-03-24 08:40:39 error                                                                             unhandledRejection: Failed to launch the browser process!
[0324/084039.048170:ERROR:zygote_host_impl_linux.cc(90)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

Error: Failed to launch the browser process!
[0324/084039.048170:ERROR:zygote_host_impl_linux.cc(90)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/root/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:255:20)
    at Interface.<anonymous> (/root/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:248:68)
    at Interface.emit (node:events:524:35)
    at Interface.close (node:internal/readline/interface:534:10)
    at Socket.onend (node:internal/readline/interface:260:10)
    at Socket.emit (node:events:524:35)
    at endReadableNT (node:internal/streams/readable:1359:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

I also tried to follow the information provided in the Troubleshooting page about Puppeteer, but without any good result: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

Ndpnt commented 1 year ago

Hi @fabianospinelli,

Did you try to run command as non-root user? I do not see the recommended instruction in your dockerfile:

…
# Run everything after as non-privileged user.
USER pptruser
…
fabianospinelli commented 1 year ago

Yes, I tried also that solution. Here the Dockerfile I used to build the image:

FROM ubuntu:latest

WORKDIR /root

RUN apt-get update \
    && apt-get upgrade -y \
    && apt install -y curl \
    && apt install -y git \
    && apt-get install -y chromium-browser

RUN curl -fsSL https://deb.nodesource.com/setup_19.x | bash - \
    && apt install -y nodejs

# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser

RUN npm install --save @opentermsarchive/engine

RUN apt-get install -y libatk1.0-0 libatk-bridge2.0-0 libcups2 libxkbcommon-x11-0 libxcomposite-dev \
    libxdamage1 libxrandr2 libpangocairo-1.0-0 libasound2 libgbm-dev libnss3

WORKDIR /home/pptruser

RUN mkdir /home/pptruser/open-terms-archive \
 && mkdir /home/pptruser/open-terms-archive/declarations \
 && mkdir /home/pptruser/open-terms-archive/config \
 && cd /home/pptruser/open-terms-archive

COPY declarations.json /home/pptruser/open-terms-archive/declarations

COPY default.json /home/pptruser/open-terms-archive/config

# Run everything after as non-privileged user.
USER pptruser

After the build has successful, if I run the Open Terms Archive from /home/pptruser/open-terms-archive I obtain the following error:

pptruser@f0a397eb5789:~/open-terms-archive$ npx ota track
npm ERR! could not determine executable to run

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/pptruser/.npm/_logs/2023-03-29T09_25_59_823Z-debug-0.log

I also tried to update "npm" to the latest version adding this statement to the Dockerfile:

RUN npm install -g npm@latest

but the result is always the same. Any suggestions?

Ndpnt commented 1 year ago

The errors you have in the latest message is due to the fact that the engine was installed outside the open-terms-archive directory. It can be fixed by installing the engine within the open-terms-archive directory with:

WORKDIR /home/pptruser/open-terms-archive
RUN npm install @opentermsarchive/engine

But this won't solve the problem of running puppeteer in Docker. And I'm sorry but I tried for more than two hours and I did not succeed to get things to work. If you succeed on your side, it would be nice to share your Dockerfile with us, otherwise I suggest you take a look at our Ansible recipes to fully setup a server with the Open Terms Archive engine.

fabianospinelli commented 1 year ago

Hi Nicolas, you right, so I moved the installation statement in the right place. Moreover, I added also another statement to change the ownership of the new installed files to the user "pptruser". Here the new Dockerfile:

FROM ubuntu:latest

WORKDIR /root

RUN apt-get update \
    && apt-get upgrade -y \
    && apt install -y curl \
    && apt install -y git \
    && apt-get install -y chromium-browser

RUN curl -fsSL https://deb.nodesource.com/setup_19.x | bash - \
    && apt install -y nodejs

# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser

RUN mkdir /home/pptruser/open-terms-archive \
 && mkdir /home/pptruser/open-terms-archive/declarations \
 && mkdir /home/pptruser/open-terms-archive/config \
 && cd /home/pptruser/open-terms-archive

WORKDIR /home/pptruser/open-terms-archive

RUN npm install --save @opentermsarchive/engine \
 && npm install -g npm@latest

RUN apt-get install -y libatk1.0-0 libatk-bridge2.0-0 libcups2 libxkbcommon-x11-0 libxcomposite-dev \
    libxdamage1 libxrandr2 libpangocairo-1.0-0 libasound2 libgbm-dev libnss3

RUN chown -R pptruser:pptruser /home/pptruser/open-terms-archive/

WORKDIR /home/pptruser

COPY declarations.json /home/pptruser/open-terms-archive/declarations

COPY default.json /home/pptruser/open-terms-archive/config

# Run everything after as non-privileged user.
USER pptruser

It seems I resolved a part of the problem but executing the Open Terms Archive I obtain this new error:

pptruser@1366670c10d6:~/open-terms-archive$ npx --no-sandbox ota track
2023-03-29 15:26:57 info                                                                              Start Open Terms Archive

2023-03-29 15:26:57 info                                                                              Examining 1 documents from 1 services for refiltering…
2023-03-29 15:26:58 info                                                                              Examined 1 documents from 1 services for refiltering
2023-03-29 15:26:58 info                                                                              Recorded 0 new versions

2023-03-29 15:26:58 info                                                                              Tracking changes of 1 documents from 1 services…
2023-03-29 15:26:58 error                                                                             unhandledRejection: Failed to launch the browser process!
[0329/152658.370405:FATAL:zygote_host_impl_linux.cc(117)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
#0 0x559ca51e4339 base::debug::CollectStackTrace()
#1 0x559ca515af23 base::debug::StackTrace::StackTrace()
#2 0x559ca5158070 logging::LogMessage::~LogMessage()
#3 0x559ca3158c2b content::ZygoteHostImpl::Init()
#4 0x559ca4cd5c0f content::ContentMainRunnerImpl::Initialize()
#5 0x559ca4cd3bfd content::RunContentProcess()
#6 0x559ca4cd3d4e content::ContentMain()
#7 0x559ca4d2b20a headless::(anonymous namespace)::RunContentMain()
#8 0x559ca4d2af15 headless::HeadlessShellMain()
#9 0x559ca160c1e3 ChromeMain
#10 0x7f469d679d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
#11 0x7f469d679e40 __libc_start_main
#12 0x559ca160c02a _start

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

Error: Failed to launch the browser process!
[0329/152658.370405:FATAL:zygote_host_impl_linux.cc(117)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
#0 0x559ca51e4339 base::debug::CollectStackTrace()
#1 0x559ca515af23 base::debug::StackTrace::StackTrace()
#2 0x559ca5158070 logging::LogMessage::~LogMessage()
#3 0x559ca3158c2b content::ZygoteHostImpl::Init()
#4 0x559ca4cd5c0f content::ContentMainRunnerImpl::Initialize()
#5 0x559ca4cd3bfd content::RunContentProcess()
#6 0x559ca4cd3d4e content::ContentMain()
#7 0x559ca4d2b20a headless::(anonymous namespace)::RunContentMain()
#8 0x559ca4d2af15 headless::HeadlessShellMain()
#9 0x559ca160c1e3 ChromeMain
#10 0x7f469d679d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
#11 0x7f469d679e40 __libc_start_main
#12 0x559ca160c02a _start

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/home/pptruser/open-terms-archive/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:255:20)
    at Interface.<anonymous> (/home/pptruser/open-terms-archive/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:248:68)
    at Interface.emit (node:events:524:35)
    at Interface.close (node:internal/readline/interface:534:10)
    at Socket.onend (node:internal/readline/interface:260:10)
    at Socket.emit (node:events:524:35)
    at endReadableNT (node:internal/streams/readable:1359:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

How can I disable the sandbox? Have you an idea? However, I will try to use also your Ansible solution. I tried just looking at the Ansible scripts and saw that they use Docker. So what's the difference between using this solution and using Docker directly?

Ndpnt commented 1 year ago

Disabling the sandbox is strongly discouraged and currently it is not possible without modifying the engine. It can only be done when the puppeteer browser is instantiated with options --no-sandbox and --disable-setuid-sandbox like this:

const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });  

I tried just looking at the Ansible scripts and saw that they use Docker.

I'm not sure to understand what you mean here because as far as I know, Ansible does not use Docker.

So what's the difference between using this solution and using Docker directly?

Docker and Ansible serve different purposes:

For example, to highlight the difference, Ansible can be used to manage and deploy Docker containers.

fabianospinelli commented 1 year ago

Hi, finally I managed to solve the problem with Docker. Now the Dockerfile below manages to create a fully functional image of OpenTermsArchive. Over the next few days I will also provide the Docker composer and the files (json, .env, etc) that are copied during the image build phase to configure the environment (declarations, git, etc.)

FROM ubuntu:latest

WORKDIR /root

RUN apt-get update \ 
    && apt-get upgrade -y \
    && apt install -y curl \
    && apt install -y git \ 
    && apt-get install -y chromium-browser

RUN curl -fsSL https://deb.nodesource.com/setup_19.x | bash - \
    && apt install -y nodejs

# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser

RUN apt-get update \
    && apt-get install -y wget gnupg \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /usr/share/keyrings/googlechrome-linux-keyring.gpg \
    && sh -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/googlechrome-linux-keyring.gpg] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-khmeros fonts-kacst fonts-freefont-ttf libxss1 \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* 

RUN mkdir /home/pptruser/open-terms-archive \
 && mkdir /home/pptruser/open-terms-archive/declarations \
 && mkdir /home/pptruser/open-terms-archive/config \
 && cd /home/pptruser/open-terms-archive

WORKDIR /home/pptruser/open-terms-archive

RUN npm install --save @opentermsarchive/engine \
 && npm install -g npm@latest

COPY declarations/OpenTermsArchive.json /home/pptruser/open-terms-archive/declarations

COPY declarations/Siretessile.json /home/pptruser/open-terms-archive/declarations

COPY default.json /home/pptruser/open-terms-archive/config

COPY env /home/pptruser/open-terms-archive/.env

RUN chown -R pptruser:pptruser /home/pptruser/open-terms-archive/

WORKDIR /home/pptruser

# Run everything after as non-privileged user.
USER pptruser

RUN npm i puppeteer \
    && (node -e "require('child_process').execSync(require('puppeteer').executablePath() + ' --credits', {stdio: 'inherit'})" > THIRD_PARTY_NOTICES)
Ndpnt commented 1 year ago

Hi @fabianospinelli, Well done for finally solving this issue 👍. If you could create a public repository of a fully functional OTA configuration with Docker, we would be happy to reference it in the documentation for users who want to use Docker 🙂.

MattiSG commented 1 year ago

Congratulations for getting Open Terms Archive running with Docker! 😃 I understand that this issue has been solved and will close it now 🙂

MattiSG commented 1 year ago

Hi @fabianospinelli! It's been a month since you indicated you managed to run Open Terms Archive with Docker and expressed your intention to share the files necessary to that end 🙂 Can we help you with this publishing process?

fabianospinelli commented 1 year ago

Hi @MattiSG and thanks for your reply. I'm happy to be able to contribute to OTA with the Docker part we developed. Let me know how to share what we have done and I will do it in these days.

Ndpnt commented 1 year ago

Hi @fabianospinelli, Can you create a public GitHub repository containing your working Dockerfile with instructions on how to use it to run an Open Terms Archive engine and how to configure it?