elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 981 forks source link

Unable to index with elasticsearch-spark on serverless elasticsearch #2222

Open RalphSchuurman opened 3 weeks ago

RalphSchuurman commented 3 weeks ago

What kind an issue is this?

Issue description

I received access to Elasticsearch serverless and would like to move over, but I am unable to get the elasticsearch-spark connector to work. I am using Databricks with 13.3LTS Runtime, Scala 2.12 and Spark 3.4.1. Using org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 because when calling the client using the elasticsearch-serverless library it gave 8.11.0 as version.

from elasticsearch_serverless import Elasticsearch
client = Elasticsearch(serverless-endpoint, api_key = 'xxx')
client.info()

gives

ObjectApiResponse({'name': 'serverless', 'cluster_name': 'xxx', 'cluster_uuid': 'xxx', 'version': {'number': '8.11.0', 'build_flavor': 'serverless', 'build_type': 'docker', 'build_hash': '00000000', 'build_date': '2023-10-31', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '8.11.0', 'minimum_index_compatibility_version': '8.11.0'}, 'tagline': 'You Know, for Search'})

Steps to reproduce

Code:


endpoint = 'serverless-endpoint'
username = 'username'
password = 'password'
index_name = 'index'
(df.write.format("org.elasticsearch.spark.sql")
    .option( "es.nodes",   endpoint)
    .option("es.port","443")
    .option("es.mapping.id","Identifier")
    .option("es.net.ssl","true")
    .option( "es.nodes.wan.only", "true" )
    .option( "es.net.http.auth.user", headers["username"])
    .option( "es.net.http.auth.pass", headers["password"])
    .option( "es.field.read.empty.as.null", "true")
    .option('es.batch.write.retry.count', "5")
    .option('es.bath.write.retry.wait', "25")
    .option("es.write.operation", "upsert")
    .mode('append')
    .save(index_name))

Strack trace:

Py4JJavaError: An error occurred while calling o609.save.
: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

Changin es.nodes.wan.only to false does not change the outcome.

Version Info

OS: : Databricks with 13.3LTS Runtime, Scala 2.12 and Spark 3.4.1 Hadoop/Spark: org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 ES : Elasticsearch Serverless

masseyke commented 3 weeks ago

Hi @RalphSchuurman. Serverless Elasticsearch currently only supports a subset of full Elasticsearch functionality. Es-hadoop/spark is not supported, and there are no immediate plans to support it.