Open DJRickyB opened 2 years ago
Pinging @elastic/es-delivery (Team:Delivery)
I don't have a lot of understanding here but another concern is compatibility. Unlike our Docker distribution, in which we own the runtime environment (mostly) enabling this for all Linux packages in general is tougher. To what degree does the Cloudflare zlib depend on system dynamic libraries or can we build a statically-linked binary to minimize any system compatibility concerns.
+1 to building a statically-linked library, to make cloudflare-zlib universally available in all distributions that can support it instead of the LD_LIBRARY_PATH
solution we have currently in the Docker distribution. There are a number of articles, this one is a good start, mentioning LD_LIBRARY_PATH
should only be used in testing or development and not in production. I am sure it is fine in our isolated Docker distribution, I imagine users not using Docker would like to realize the benefits of Cloudflare zlib as well and we probably should not be recommending any modification to LD_LIBRARY_PATH
.
+1 to building a statically-linked library
It's one thing to build a static library, but that can only get used if the program that currently expects a dynamic library is rebuilt to link the static library at build time.
In this case I think that program that would need to get rebuilt is java
right? So statically linking zlib is really a proposal to produce our own build of Java, and not using the exact build system that comes with Java but with a modified build system that statically links zlib.
I agree LD_LIBRARY_PATH
is considered bad practice - if it lists directories that don't contain the shared library then it opens up the possibility of an attacker putting a malicious library with the same name in one of those directories.
The other possibility that's not quite as radical as rebuilding java
from scratch is to set an RPATH
in the existing java
program. This takes priority over LD_LIBRARY_PATH
when looking for libraries and can be set without rebuilding using a tool called patchelf
. We do this for some libraries we ship with ml-cpp in https://github.com/elastic/ml-cpp/blob/ecb9c6c036d1104c68c90281d554312285c2ac6d/3rd_party/3rd_party.sh#L286
The way library search works is that first RPATH
is searched, then LD_LIBRARY_PATH
, then RUNPATH
and finally the system default locations. RPATH
and RUNPATH
are embedded in the executable and LD_LIBRARY_PATH
is an environment variable.
So if you set RPATH
to $ORIGIN/../lib
then a program will prefer to load its shared libraries from a lib
directory parallel to the bin
directory it's in.
java
currently sets a RUNPATH
that looks in the same directory its in and in the parallel lib
directory, for example:
$ readelf -d java | head -40
Dynamic section at offset 0x1d08 contains 33 entries:
Tag Type Name/Value
0x0000000000000003 (PLTGOT) 0x2f88
0x0000000000000002 (PLTRELSZ) 288 (bytes)
0x0000000000000017 (JMPREL) 0x7d8
0x0000000000000014 (PLTREL) RELA
0x0000000000000007 (RELA) 0x718
0x0000000000000008 (RELASZ) 192 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffff9 (RELACOUNT) 4
0x0000000000000015 (DEBUG) 0x0
0x0000000000000006 (SYMTAB) 0x278
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000005 (STRTAB) 0x470
0x000000000000000a (STRSZ) 381 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x5f0
0x0000000000000004 (HASH) 0x628
0x0000000000000001 (NEEDED) Shared library: [libz.so.1]
0x0000000000000001 (NEEDED) Shared library: [libjli.so]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x8f8
0x000000000000000d (FINI) 0xd3c
0x000000000000001a (FINI_ARRAY) 0x2cf8
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x0000000000000019 (INIT_ARRAY) 0x2d00
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001d (RUNPATH) Library runpath: [$ORIGIN:$ORIGIN/../lib] <-- HERE !
0x000000000000001e (FLAGS) BIND_NOW
0x000000006ffffffb (FLAGS_1) Flags: NOW PIE
0x000000006ffffff0 (VERSYM) 0x6c8
0x000000006ffffffe (VERNEED) 0x6f4
0x000000006fffffff (VERNEEDNUM) 1
0x0000000000000000 (NULL) 0x0
So I think if you put the dynamic library Cloudflare zlib in $JAVA_HOME/lib
then it will get used in preference to the system version unless LD_LIBRARY_PATH
is set. Or you could make it more robust by setting an RPATH
on java
so that it's immune to LD_LIBRARY_PATH
manipulation. And if you wanted that RPATH
could include a separate directory for Cloudflare's zlib in addition to $ORIGIN:$ORIGIN/../lib
.
I'd like to consider another option here which is providing guidance for folks deploying on Linux (and not using Docker) on how to install cloudflare-zlib themselves rather than us hack a distribution that tries to load it for them. While we certainly want to expose this optimization to as many users as possible, we don't want to sacrifice or complicate portability, which increases the more native code we package in Elasticsearch.
I'd like to consider another option here which is providing guidance for folks deploying on Linux (and not using Docker) on how to install cloudflare-zlib themselves rather than us hack a distribution that tries to load it for them. While we certainly want to expose this optimization to as many users as possible, we don't want to sacrifice or complicate portability, which increases the more native code we package in Elasticsearch.
I had considered this solution and I am -1 as it pushes the burden to users to configure it properly. There could be an added support burden as it may not always be clear which zlib is in use.
I had considered this solution and I am -1 as it pushes the burden to users to configure it properly. There could be an added support burden as it may not always be clear which zlib is in use.
Of course, I just wanted to make this explicit and ensure that we are triaging that burden against the technical complexity and maintenance cost of supporting bundling this functionality in. Let's keep in mind this is an optimization so it should be treated slightly differently than something that affects actual functionality.
There could be an added support burden as it may not always be clear which zlib is in use.
I'd also argue that issues related to us effectively "hacking" Java to use our bundled dynamic library are sure to arise, and will have a cost as well.
Thanks all. After recent discussion I've updated the title and the description to better reflect more concerns
While working on a a local build of elasticsearch in docker for my company I discovered the changes in 8.0+ to use the cloudflare zlib (#81245, and https://github.com/elastic/dockerfiles/pull/95). The problem with what has been merged is that these do not actually use the cloudflare zlib. The cloudflare zlib repo (https://github.com/cloudflare/zlib) is a fork of upstream zlib (https://github.com/madler/zlib). If you look you will see that the tags present in the cloudflare zlib repo are not from cloudflare, they are forked from upstream.
cfb@sandman:foo/> wget https://github.com/madler/zlib/archive/refs/tags/v1.2.8.tar.gz -O upstream
--2022-03-04 11:07:44-- https://github.com/madler/zlib/archive/refs/tags/v1.2.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/madler/zlib/tar.gz/refs/tags/v1.2.8 [following]
--2022-03-04 11:07:45-- https://codeload.github.com/madler/zlib/tar.gz/refs/tags/v1.2.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 604952 (591K) [application/x-gzip]
Saving to: ‘upstream’
upstream 100%[================================================================================================================================================================================================================================================================>] 590.77K 3.00MB/s in 0.2s
2022-03-04 11:07:45 (3.00 MB/s) - ‘upstream’ saved [604952/604952]
cfb@sandman:foo/> wget https://github.com/cloudflare/zlib/archive/refs/tags/v1.2.8.tar.gz -O cloudflare
--2022-03-04 11:07:59-- https://github.com/cloudflare/zlib/archive/refs/tags/v1.2.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/cloudflare/zlib/tar.gz/refs/tags/v1.2.8 [following]
--2022-03-04 11:07:59-- https://codeload.github.com/cloudflare/zlib/tar.gz/refs/tags/v1.2.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 604952 (591K) [application/x-gzip]
Saving to: ‘cloudflare’
cloudflare 100%[================================================================================================================================================================================================================================================================>] 590.77K 3.11MB/s in 0.2s s
2022-03-04 11:08:00 (3.11 MB/s) - ‘cloudflare’ saved [604952/604952]
cfb@sandman:foo/> sha256sum *
e380bd1bdb6447508beaa50efc653fe45f4edc1dafe11a251ae093e0ee97db9a cloudflare
e380bd1bdb6447508beaa50efc653fe45f4edc1dafe11a251ae093e0ee97db9a upstream
The net effect of what has been merged at this time is that elastic is not seeing any performance improvements as its not running the cloudflare code. Additionally, by using the older upstream tag the docker images have the following CVEs that would be fixed by using the zlib package provided in Ubuntu 20.04.
https://nvd.nist.gov/vuln/detail/CVE-2016-9840 - CVSS V3.x Score: 8.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9841 - CVSS V3.x Score: 9.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9842 - CVSS V3.x Score: 8.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9843 - CVSS V3.x Score: 9.8
Since it seems there is some debate as to exactly what/how to provide this improved performance I would recommend the changes to this repo (#81245) and the public docker build repo (https://github.com/elastic/dockerfiles/pull/95) be reverted until this discussion is resolved.
It seems to me that the issue with Cloudflare zlib is it's very much not intended for general consumption. The only way to stay up to date with the latest upstream zlib is to consume HEAD from that repo which is far to prone to errors and I don't think we have sufficient coverage against our Docker packaging format to give us strong confidence here.
My preference would be to revert this change until we can investigate a better way to ensure we are integrating an up-to-date fork of zlib and have better test coverage in this area. @jpountz thoughts?
@mark-vieira +1 to revert for now
We should also investigate the use of Intel's IPP zlib as another alternative option.
As discussed in #81208, using Cloudflare's zlib implementation improves performance in the following cases:
index.codec: best_compression
, which we use by default for observability and security integrations' data.transport.compression_scheme: deflate
(not the default for local transport, but is the default for remote clusters when compression is enabled as of #76580)Whereas #81208 scoped the initial approach to our Docker image, where we control both the distribution and the environment, this issue asks if we can go one step further and bundle it in our tarballs. Some general notes about why this is worth considering:
elasticsearch
script, alleviating need to control the environment variables externally.81245 added a build step to the Dockerfile for the cloudflare-zlib library. If we could instead host or fetch the desired library we could remove a build step.
If instead we packaged an improved zlib in the Elasticsearch distribution, we could reap the following advantages:
best_compression
, which can be considered for more uses potentially if it comes with a lower expected overheadDisadvantages:
We recently brought up this issue in a Fix-It meeting, and open questions included:
zlib-ng
, which is actively maintained and spells out a few of its own advantages versus Cloudflare nicely here: https://github.com/zlib-ng/zlib-ng/discussions/871#discussioncomment-497539