elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 989 forks source link

Does Elasticsearch-Hadoop support HTTPS proxy connections ? #2230

Open AVN9399 opened 4 months ago

AVN9399 commented 4 months ago

What kind an issue is this?

Issue description

n my virtual machine, my Spark application is trying to push data to an Elasticsearch server using the JavaEsSparkSQL.saveToEs() function. This requires passing through an HTTPS proxy. Although it seems Spark can connect to the proxy server, it appears not to recognize the proxy's username and password.

The proxy is fully accessible using curl or from other parts of my Java code.

I've tried numerous approaches, but none have been successful. I'm unsure if Elasticsearch-Hadoop supports HTTPS proxy connections.

Steps to reproduce

Code:

My spark-defaults.conf

spark.executor.extraJavaOptions   -Dhttps.proxyHost=proxyhost-Dhttps.proxyPort=proxyport- Dhttps.proxyUser=XXXX  -Dhttps.proxyPassword=XXXX -Djdk.http.auth.tunneling.disabledSchemes=  -Djdk.http.auth.proxying.disabledSchemes=
spark.driver.extraJavaOptions     -Dderby.system.home=/tmp/derby/ -Dhttps.proxyHost=proxyhost  -Dhttps.proxyPort=proxyport -Dhttps.proxyUser=XXXX  -Dhttps.proxyPassword=XXXX  -Djdk.http.auth.tunneling.disabledSchemes=   -Djdk.http.auth.proxying.disabledSchemes=

Strack trace:

24/05/21 17:08:17 DEBUG HeaderProcessor: Added HTTP Headers to method: [X-Opaque-ID: [spark] [portail] [Projet_P16016_SEARCH ENGINE To Elastic Search Data] [app-20240521170655-0092]
, User-Agent: elasticsearch-hadoop/8.12.0 spark/3.1.1
, Content-Type: application/json
, Accept: application/json
]
24/05/21 17:08:17 DEBUG CommonsHttpTransport: Using regular user provider to wrap rest request
24/05/21 17:08:17 TRACE CommonsHttpTransport: Tx [HTTPS proxyhost:proxyport][GET]@[elasticsearchserver:443][]?[null] w/ payload [null]
24/05/21 17:08:17 WARN HttpMethodDirector: Required proxy credentials not available for BASIC <any realm>@proxyhost:proxyport
24/05/21 17:08:17 WARN HttpMethodDirector: Preemptive authentication requested but no default proxy credentials available
24/05/21 17:08:17 INFO AuthChallengeProcessor: Basic authentication scheme selected
24/05/21 17:08:17 INFO HttpMethodDirector: Failure authenticating with BASIC 'ECH'@proxyhost:proxyport
24/05/21 17:08:17 TRACE CommonsHttpTransport: Rx [HTTPS proxy proxyhost:proxyport]@[10.86.XX.XXX] [407-Proxy Authentication Required]
...
...
...
24/05/21 17:08:17 TRACE CommonsHttpTransport: Closing HTTP transport to 109es125.fr1.esaas.tech.orange:443
Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:403)
        at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:99)
        at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:81)
        at org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL$.saveToEs(JavaEsSparkSQL.scala:51)
        at org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL.saveToEs(JavaEsSparkSQL.scala)
        at com.orange.bigdata.app.elk.v2.IndexationSEBruteOptimiseParEntite.main(IndexationSEBruteOptimiseParEntite.java:273)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[109es125.fr1.esaas.tech.orange:443] returned [407|Proxy Authentication Required]
...
...
 at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:487)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:406)
        at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:755)
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:393)
        ... 17 more

Version Info

OS :Linux JVM : Temurin-jdk-8 Hadoop/Spark: Spark3 ES-Hadoop : 8.12.0 ES : 7.17.2

Feature description