elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 986 forks source link

7.17.11 failed backwards compatibility for 5.5.3 #2200

Closed sj-ganwh closed 4 months ago

sj-ganwh commented 4 months ago

What kind an issue is this?

Issue description

Due to CVE-2023-46674, we need to update version to Elasticsearch-hadoop >= 7.17.11 or >= 8.9.0. So we plan to update to 7.17.11. According to https://github.com/elastic/elasticsearch-hadoop quote ES-Hadoop 6.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, and 6.X, so ES-Hadoop 7.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, 6.X and 7.X. But when we replace elasticsearch-hadoop-5.5.3.jar to elasticsearch-hadoop-7.17.11.jar, it fails.

Steps to reproduce

Code:

-- add jar hdfs:///lib/jdbc/elasticsearch-hadoop-5.5.3.jar;
add jar hdfs:///lib/jdbc/elasticsearch-hadoop-7.17.11.jar;

create table default.tmp_es (
    id bigint,
    name string,
    update_time timestamp
)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties (
    "es.nodes"="172.26.1.1:9200",
    "es.net.http.header.Authorization"="Basic xxx",
    "es.nodes.wan.only"="true",
    "es.nodes.discovery"="false",
    "es.http.retries"="10",
    -- index/type
    "es.resource"="tmp_es/doc",
    "es.mapping.id"="id",
    -- dst:src
    "es.mapping.names"="update_time:updateTime"
);

select * from default.tmp_es;

Strack trace:

Failed with exception java.io.IOException:org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

Version Info

OS: : CentOS 7 JVM : jdk8 Hadoop/Spark: hadoop 2.7 Hive: 2.1.1 ES-Hadoop : 7.17.11 ES : 5.5.3

sj-ganwh commented 4 months ago

Tested, elasticsearch-hadoop-7.1.1.jar is compatible with Elasticsearch 5.5.3.

sj-ganwh commented 4 months ago

CVE-2023-46674 further reading: https://discuss.elastic.co/t/elasticsearch-hadoop-7-17-11-8-9-0-security-update-esa-2023-28/348663

sj-ganwh commented 4 months ago

I changed major.before(EsMajorVersion.V_6_X to major.before(EsMajorVersion.V_5_X in https://github.com/sj-ganwh/elasticsearch-hadoop/commit/52d5505d627b714dd369164382d17eb564b30e63 and rebuild from it and it worked... These are the artifacts: https://github.com/sj-ganwh/elasticsearch-hadoop/actions/runs/8014107473 In my case, I just using Hive external table just to read from ES 5.5.3, this kind of 'force fix' has any side effects?

jbaiera commented 4 months ago

In 7.14 we added a validation check to the library to ensure that ES-Hadoop was contacting an Elasticsearch distribution and not something else during it's setup process. https://github.com/elastic/elasticsearch-hadoop/pull/1696 The header is only present in more recent versions of Elasticsearch. The connector technically could still work for older versions of Elasticsearch but without the header present it will reject connecting further beyond the initial handshake. The header should be present in Elasticsearch before upgrading ES-Hadoop to 7.14.