elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.49k stars 24.6k forks source link

Should we package an alternate zlib implementation in our distributions? #81662

Open DJRickyB opened 2 years ago

DJRickyB commented 2 years ago

As discussed in #81208, using Cloudflare's zlib implementation improves performance in the following cases:

Whereas #81208 scoped the initial approach to our Docker image, where we control both the distribution and the environment, this issue asks if we can go one step further and bundle it in our tarballs. Some general notes about why this is worth considering:

  1. When #81245 was implemented, it steered the concern of linking the alternate zlib into the elasticsearch script, alleviating need to control the environment variables externally.
  2. 81245 added a build step to the Dockerfile for the cloudflare-zlib library. If we could instead host or fetch the desired library we could remove a build step.

If instead we packaged an improved zlib in the Elasticsearch distribution, we could reap the following advantages:

Disadvantages:

We recently brought up this issue in a Fix-It meeting, and open questions included:

elasticmachine commented 2 years ago

Pinging @elastic/es-delivery (Team:Delivery)

mark-vieira commented 2 years ago

I don't have a lot of understanding here but another concern is compatibility. Unlike our Docker distribution, in which we own the runtime environment (mostly) enabling this for all Linux packages in general is tougher. To what degree does the Cloudflare zlib depend on system dynamic libraries or can we build a statically-linked binary to minimize any system compatibility concerns.

inqueue commented 2 years ago

+1 to building a statically-linked library, to make cloudflare-zlib universally available in all distributions that can support it instead of the LD_LIBRARY_PATH solution we have currently in the Docker distribution. There are a number of articles, this one is a good start, mentioning LD_LIBRARY_PATH should only be used in testing or development and not in production. I am sure it is fine in our isolated Docker distribution, I imagine users not using Docker would like to realize the benefits of Cloudflare zlib as well and we probably should not be recommending any modification to LD_LIBRARY_PATH.

droberts195 commented 2 years ago

+1 to building a statically-linked library

It's one thing to build a static library, but that can only get used if the program that currently expects a dynamic library is rebuilt to link the static library at build time.

In this case I think that program that would need to get rebuilt is java right? So statically linking zlib is really a proposal to produce our own build of Java, and not using the exact build system that comes with Java but with a modified build system that statically links zlib.

I agree LD_LIBRARY_PATH is considered bad practice - if it lists directories that don't contain the shared library then it opens up the possibility of an attacker putting a malicious library with the same name in one of those directories.

The other possibility that's not quite as radical as rebuilding java from scratch is to set an RPATH in the existing java program. This takes priority over LD_LIBRARY_PATH when looking for libraries and can be set without rebuilding using a tool called patchelf. We do this for some libraries we ship with ml-cpp in https://github.com/elastic/ml-cpp/blob/ecb9c6c036d1104c68c90281d554312285c2ac6d/3rd_party/3rd_party.sh#L286

The way library search works is that first RPATH is searched, then LD_LIBRARY_PATH, then RUNPATH and finally the system default locations. RPATH and RUNPATH are embedded in the executable and LD_LIBRARY_PATH is an environment variable.

So if you set RPATH to $ORIGIN/../lib then a program will prefer to load its shared libraries from a lib directory parallel to the bin directory it's in.

java currently sets a RUNPATH that looks in the same directory its in and in the parallel lib directory, for example:

$ readelf -d java | head -40

Dynamic section at offset 0x1d08 contains 33 entries:
  Tag        Type                         Name/Value
 0x0000000000000003 (PLTGOT)             0x2f88
 0x0000000000000002 (PLTRELSZ)           288 (bytes)
 0x0000000000000017 (JMPREL)             0x7d8
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000007 (RELA)               0x718
 0x0000000000000008 (RELASZ)             192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffff9 (RELACOUNT)          4
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000006 (SYMTAB)             0x278
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000005 (STRTAB)             0x470
 0x000000000000000a (STRSZ)              381 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x5f0
 0x0000000000000004 (HASH)               0x628
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libjli.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x8f8
 0x000000000000000d (FINI)               0xd3c
 0x000000000000001a (FINI_ARRAY)         0x2cf8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000019 (INIT_ARRAY)         0x2d00
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN:$ORIGIN/../lib] <-- HERE !
 0x000000000000001e (FLAGS)              BIND_NOW
 0x000000006ffffffb (FLAGS_1)            Flags: NOW PIE
 0x000000006ffffff0 (VERSYM)             0x6c8
 0x000000006ffffffe (VERNEED)            0x6f4
 0x000000006fffffff (VERNEEDNUM)         1
 0x0000000000000000 (NULL)               0x0

So I think if you put the dynamic library Cloudflare zlib in $JAVA_HOME/lib then it will get used in preference to the system version unless LD_LIBRARY_PATH is set. Or you could make it more robust by setting an RPATH on java so that it's immune to LD_LIBRARY_PATH manipulation. And if you wanted that RPATH could include a separate directory for Cloudflare's zlib in addition to $ORIGIN:$ORIGIN/../lib.

mark-vieira commented 2 years ago

I'd like to consider another option here which is providing guidance for folks deploying on Linux (and not using Docker) on how to install cloudflare-zlib themselves rather than us hack a distribution that tries to load it for them. While we certainly want to expose this optimization to as many users as possible, we don't want to sacrifice or complicate portability, which increases the more native code we package in Elasticsearch.

inqueue commented 2 years ago

I'd like to consider another option here which is providing guidance for folks deploying on Linux (and not using Docker) on how to install cloudflare-zlib themselves rather than us hack a distribution that tries to load it for them. While we certainly want to expose this optimization to as many users as possible, we don't want to sacrifice or complicate portability, which increases the more native code we package in Elasticsearch.

I had considered this solution and I am -1 as it pushes the burden to users to configure it properly. There could be an added support burden as it may not always be clear which zlib is in use.

mark-vieira commented 2 years ago

I had considered this solution and I am -1 as it pushes the burden to users to configure it properly. There could be an added support burden as it may not always be clear which zlib is in use.

Of course, I just wanted to make this explicit and ensure that we are triaging that burden against the technical complexity and maintenance cost of supporting bundling this functionality in. Let's keep in mind this is an optimization so it should be treated slightly differently than something that affects actual functionality.

There could be an added support burden as it may not always be clear which zlib is in use.

I'd also argue that issues related to us effectively "hacking" Java to use our bundled dynamic library are sure to arise, and will have a cost as well.

DJRickyB commented 2 years ago

Thanks all. After recent discussion I've updated the title and the description to better reflect more concerns

cburgess commented 2 years ago

While working on a a local build of elasticsearch in docker for my company I discovered the changes in 8.0+ to use the cloudflare zlib (#81245, and https://github.com/elastic/dockerfiles/pull/95). The problem with what has been merged is that these do not actually use the cloudflare zlib. The cloudflare zlib repo (https://github.com/cloudflare/zlib) is a fork of upstream zlib (https://github.com/madler/zlib). If you look you will see that the tags present in the cloudflare zlib repo are not from cloudflare, they are forked from upstream.

cfb@sandman:foo/> wget https://github.com/madler/zlib/archive/refs/tags/v1.2.8.tar.gz -O upstream
--2022-03-04 11:07:44--  https://github.com/madler/zlib/archive/refs/tags/v1.2.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/madler/zlib/tar.gz/refs/tags/v1.2.8 [following]
--2022-03-04 11:07:45--  https://codeload.github.com/madler/zlib/tar.gz/refs/tags/v1.2.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 604952 (591K) [application/x-gzip]
Saving to: ‘upstream’

upstream                                                                                            100%[================================================================================================================================================================================================================================================================>] 590.77K  3.00MB/s    in 0.2s

2022-03-04 11:07:45 (3.00 MB/s) - ‘upstream’ saved [604952/604952]

cfb@sandman:foo/> wget https://github.com/cloudflare/zlib/archive/refs/tags/v1.2.8.tar.gz -O cloudflare
--2022-03-04 11:07:59--  https://github.com/cloudflare/zlib/archive/refs/tags/v1.2.8.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/cloudflare/zlib/tar.gz/refs/tags/v1.2.8 [following]
--2022-03-04 11:07:59--  https://codeload.github.com/cloudflare/zlib/tar.gz/refs/tags/v1.2.8
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 604952 (591K) [application/x-gzip]
Saving to: ‘cloudflare’

cloudflare                                                                                          100%[================================================================================================================================================================================================================================================================>] 590.77K  3.11MB/s    in 0.2s    s

2022-03-04 11:08:00 (3.11 MB/s) - ‘cloudflare’ saved [604952/604952]

cfb@sandman:foo/> sha256sum *
e380bd1bdb6447508beaa50efc653fe45f4edc1dafe11a251ae093e0ee97db9a  cloudflare
e380bd1bdb6447508beaa50efc653fe45f4edc1dafe11a251ae093e0ee97db9a  upstream

The net effect of what has been merged at this time is that elastic is not seeing any performance improvements as its not running the cloudflare code. Additionally, by using the older upstream tag the docker images have the following CVEs that would be fixed by using the zlib package provided in Ubuntu 20.04.

https://nvd.nist.gov/vuln/detail/CVE-2016-9840 - CVSS V3.x Score: 8.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9841 - CVSS V3.x Score: 9.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9842 - CVSS V3.x Score: 8.8 https://nvd.nist.gov/vuln/detail/CVE-2016-9843 - CVSS V3.x Score: 9.8

Since it seems there is some debate as to exactly what/how to provide this improved performance I would recommend the changes to this repo (#81245) and the public docker build repo (https://github.com/elastic/dockerfiles/pull/95) be reverted until this discussion is resolved.

mark-vieira commented 2 years ago

It seems to me that the issue with Cloudflare zlib is it's very much not intended for general consumption. The only way to stay up to date with the latest upstream zlib is to consume HEAD from that repo which is far to prone to errors and I don't think we have sufficient coverage against our Docker packaging format to give us strong confidence here.

My preference would be to revert this change until we can investigate a better way to ensure we are integrating an up-to-date fork of zlib and have better test coverage in this area. @jpountz thoughts?

jpountz commented 2 years ago

@mark-vieira +1 to revert for now

b-deam commented 2 years ago

We should also investigate the use of Intel's IPP zlib as another alternative option.