elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 986 forks source link

Es.write.operation documentation is deceptive on default values when used via spark #2206

Open robwithhair opened 3 months ago

robwithhair commented 3 months ago

What kind an issue is this?

Issue description

Documentation suggests default es.write.operation is index but when used via spark output mode "update" the default mode is actually upsert. This information is only available by reading code.

Documentation is deceptive because it suggests that in spark update mode the default value of index will be used when actually the default is overridden to be "upsert" it appears in testing and by visually reviewing code.

Steps to reproduce

Code:

N/A as is documentation fix

Strack trace:

N/A

jbaiera commented 3 months ago

This could be better detailed in the docs for sure.

When using update mode in Spark SQL, the connector changes the operation to be "upsert" since 1) it needs to use that request mode to satisfy the invariants defined by Spark and 2) it's anticipating your need for that setting to be set to use that mode and so it just sets it for you so you don't have to say you want to update data in multiple places.

Fun fact: There are actually quite a lot of things in Spark that we plug into in order to modify the connector's behavior based on your API usage, like pushing down queries to ES (by default we don't filter results from the server, but we generate queries based on the query plan if we're able to) or limiting returned fields from the server (we'll intercept the field projection from Spark if it's available so we don't pull a bunch of fields from each document that aren't needed for the operation). It's tough to list these all out because in some cases we are merging existing configurations together, in other cases we override them, and sometimes we're just offloading some of the concern on to the library code so users don't have to worry about configurations.