elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 990 forks source link

shard preference concatenation with | gives query error #874

Closed megri closed 8 years ago

megri commented 8 years ago

What kind an issue is this?

Issue description

This change results in

{"error":{"root_cause":[{"type":"number_format_exception","reason":"For input string: \"0|_local\""}],"type":"number_format_exception","reason":"For input string: \"0|_local\""},"status":400}%

Steps to reproduce

Launch spark-shell with ./bin/spark-shell --packages --packages org.elasticsearch:elasticsearch-spark-20_2.11:5.0.0-rc1

NOTE Running with org.elasticsearch:elasticsearch-spark-20_2.11:5.0.0-beta1 works

Code:

scala> import org.elasticsearch.spark._
import org.elasticsearch.spark._

scala> val rdd = sc.esRDD("myindex")
rdd: org.apache.spark.rdd.RDD[(String, scala.collection.Map[String,AnyRef])] = ScalaEsRDD[0] at RDD at AbstractEsRDD.scala:19

scala> rdd.count

Strack trace:

➜  spark-2.0.1-bin-hadoop2.7 ./bin/spark-shell --conf spark.es.nodes.wan.only=true --packages org.elasticsearch:elasticsearch-spark-20_2.11:5.0.0-rc1
Ivy Default Cache set to: /***/.ivy2/cache
The jars for the packages stored in: /***/.ivy2/jars
:: loading settings :: url = jar:file:/***/dev/labs/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.elasticsearch#elasticsearch-spark-20_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found org.elasticsearch#elasticsearch-spark-20_2.11;5.0.0-rc1 in central
:: resolution report :: resolve 174ms :: artifacts dl 2ms
    :: modules in use:
    org.elasticsearch#elasticsearch-spark-20_2.11;5.0.0-rc1 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 1 already retrieved (0kB/6ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/10/20 18:25:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/20 18:25:04 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.91.88.79:4040
Spark context available as 'sc' (master = local[*], app id = local-1476980704689).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.elasticsearch.spark._
import org.elasticsearch.spark._

scala> val rdd = sc.esRDD("myindex")
rdd: org.apache.spark.rdd.RDD[(String, scala.collection.Map[String,AnyRef])] = ScalaEsRDD[0] at RDD at AbstractEsRDD.scala:19

scala> rdd.count
[Stage 0:>                                                          (0 + 0) / 3]16/10/20 18:27:02 ERROR NetworkClient: Node [localhost:9200] failed (Invalid target URI POST@null/comicbook/_search?search_type=scan&scroll=5m&size=50&preference=_shards:0|_local); no other nodes left - aborting...
16/10/20 18:27:02 ERROR NetworkClient: Node [localhost:9200] failed (Invalid target URI POST@null/comicbook/_search?search_type=scan&scroll=5m&size=50&preference=_shards:2|_local); no other nodes left - aborting...
16/10/20 18:27:02 ERROR NetworkClient: Node [localhost:9200] failed (Invalid target URI POST@null/comicbook/_search?search_type=scan&scroll=5m&size=50&preference=_shards:1|_local); no other nodes left - aborting...
16/10/20 18:27:02 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
16/10/20 18:27:02 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
16/10/20 18:27:02 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
16/10/20 18:27:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

16/10/20 18:27:02 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1134)
  ... 50 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
  at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:436)
  at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:363)
  at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
  at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

Version Info

OS: : OS X Sierra JVM : Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode) Hadoop/Spark: ?/2.0.1 ES-Hadoop : 5.0.0-rc1 ES : 2.4.1

jbaiera commented 8 years ago

Thanks for opening this. I have a fix in flight ready to go into master, but I'm waiting until #877 gets merged in because it affects the same location in the source.

megri commented 8 years ago

Regarding issue #877 , I think the fix presented there is a red herring. The problem isn't with unencoded : or | but in fact the same problem outlined in this issue.

The proper fix:

edit: reference: https://www.elastic.co/guide/en/elasticsearch/reference/5.x/breaking_50_search_changes.html#_search_preferences

I recommend to not merge #877.

jbaiera commented 8 years ago

This should be fixed with this commit : https://github.com/elastic/elasticsearch-hadoop/commit/d442bb4479c2ae8478a10f25a7f0edb6b7256d87