Open valguz opened 4 years ago
This is really a 2-parter: Fixing the immediate issue and also making sure that we test with punctuation in field names across the entire project.
Looking at https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/util/StringUtils.java#L42 and where the concatenation is done here https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/SchemaUtils.scala#L295 it looks like from an outsiders perspective maybe representing the data internally as something other than a string is the way to go. We did a very hacky workaround by forking this library and the remove the final declaration of delimiter variable... so we could configure it in app to something like a record separator org.elasticsearch.hadoop.util.StringUtils.DEFAULT_DELIMITER = "\u001e"
. That requires us to have to define that seperator in other places... for instance es.read.field.exclude
becomes:
.option(es.read.field.exclude", "some_field_here\u001e other_field_here")
I look forward to following this issue and see what the ES Hadoop team comes up. Thanks for posting this @valguz
I've got a draft PR up that fixes this one, but we're a little worried that it might introduce unexpected new problems. Is anyone still running into this? I'm not sure how common it is to use commas in field names (I had never seen it before coming across this ticket, and would have guessed that Elasticsearch did not allow it).
What kind an issue is this?
The easier it is to track down the bug, the faster it is solved.
Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.
Issue description
Description: ES supports fields with commas in them, however, the Hadoop library doesn't seem to support this.
Steps to reproduce
Code:
On ES, create an index:
Put some data in it:
Create a new program (this one is in scala):
Error results as follows:
Strack traces: