confluentinc / confluent-kafka-javascript

Confluent's Apache Kafka JavaScript client
https://www.npmjs.com/package/@confluentinc/kafka-javascript
MIT License
92 stars 5 forks source link

Unable to construct viable Docker image using `node:20-alpine` #48

Open apeloquin-agilysys opened 1 month ago

apeloquin-agilysys commented 1 month ago

Our build uses an ubuntu-latest Github runner to build a Docker image. image

Our Dockerfile follows the example provided in this repo.

FROM node:20-alpine
COPY ./dist /app/
WORKDIR /app
RUN apk --no-cache add \
  bash \
  g++ \
  ca-certificates \
  lz4-dev \
  musl-dev \
  cyrus-sasl-dev \
  openssl-dev \
  make \
  python3 \
  gcompat # added to provide missing ld-linux-x86-64.so.2
RUN apk add --no-cache --virtual .build-deps gcc zlib-dev libc-dev bsd-compat-headers py-setuptools bash
RUN npm install --omit=dev

EXPOSE 4000
CMD [ "node", "app.js" ]

The deployed pod is hosted in AKS, and both the runners and host nodes are amd64 arch.

Without the @confluentinc/kafka-javascript dependency in the package.json, the application will start without issue on the container.

With the @confluentinc/kafka-javascript dependency in the package.json (and no reference from the application), the application will immediately fail with:

Segmentation fault (core dumped)

While troubleshooting, we discovered that if we reinstalled the package on the running container, the application would then startup normally.

Initial thought was that the wrong flavor of librdkafka was being download.

By adding the following to the Dockerfile, I was able to capture the node-pre-gyp output:

WORKDIR /app/node_modules/@confluentinc/kafka-javascript
RUN npx node-pre-gyp install --update-binary
WORKDIR /app
#12 0.816 node-pre-gyp info using node-pre-gyp@1.0.11
#12 0.816 node-pre-gyp info using node@20.13.1 | linux | x64
#12 0.906 node-pre-gyp http GET https://github.com/confluentinc/confluent-kafka-javascript/releases/download/v0.1.15-devel/confluent-kafka-javascript-v0.1.15-devel-node-v115-linux-musl-x64.tar.gz

Again, launching this container results in the segmentation fault on startup.

Starting the container, and running the following:

cd node_modules/\@confluentinc/kafka-javascript/
npx node-pre-gyp install --update-binary
cd /app

...seemingly performs the same operation we saw during the Docker image construction:

node-pre-gyp info using node-pre-gyp@1.0.11
node-pre-gyp info using node@20.13.1 | linux | x64
http GET https://github.com/confluentinc/confluent-kafka-javascript/releases/download/v0.1.15-devel/confluent-kafka-javascript-v0.1.15-devel-node-v115-linux-musl-x64.tar.gz

...yet after this operation is performed, the application starts without issue.

Please help us to understand what is going on here, and how we can solve this problem.

apeloquin-agilysys commented 1 month ago

When I do a diff of the node_modules/@confluentinc/kafka-javascript/build/Release/ directories before/after running the node-pre-gyp on the started container, the noticeable difference is many instances of:

If I download directly from confluent-kafka-javascript-v0.1.15-devel-node-v115-linux-musl-x64.tar.gz I see the references are all /root/.cache and /v; so it's unclear to me how the docker image is ending up with an apparently different version despite resolving to the same download URL.

milindl commented 1 month ago

Hey - I repro'd this issue, but I'm not sure of the cause yet. The confluent-kafka-javascript.node is different at the start and at the end after running npx node-pre-gyp install --update-binary (I checked with the md5sum).

Here's my process:

  1. npm init an app in the 'dist' folder and install @confluent/kafka-javascript. MD5 sum of confluent-kafka-javascript.node = X
  2. Build and run docker file. Here too MD5 sum of confluent-kafka-javascript.node = X
  3. Run the node-pre-gyp command. Now the MD5 sum of confluent-kafka-javascript.node = Y, and the linkings have also changed (after running ldd).

Suggested workaround for now:

COPY ./dist /app/
WORKDIR /app
+ RUN rm -rf node_modules

(you can also just delete node_modules/@confluentinc if you want to be more specific).

As far as I can understand, the npm install within the Dockerfile isn't re-pulling the right platform/libc combo of confluent-kafka-javascript.node, and just goes on with whatever is there within the node_modules unless it's empty.

milindl commented 1 month ago

Also, since there is the pre-compiled binary now, the dockerfile can be trimmed to a great extent, something like:

FROM node:20-alpine
COPY ./dist /app/
WORKDIR /app
RUN rm -rf node_modules/\@confluentinc
RUN npm install --omit=dev

EXPOSE 4000
CMD [ "node", "app.js" ]

I will update the example.

milindl commented 1 month ago

I have a fix in mind, changing the npm install script to node-pre-gyp install --fallback-to-build --update-binary rather than node-pre-gyp install --fallback-to-build, however, that will incur the download of a remote binary more than required, so I'm not making that change immediately.

I'll discuss that, and other possible solutions with my team, and provide a fix.