Closed julienmarie closed 2 years ago
It's running on Harness with Docker-compose Version 1.1-SNAPSHOT
Does the name “elasticsearch” resolve to your Elasticsearch node address? Does the integration test complete correctly? If so compare the value of “spark.es.nodes” in the integration test config to yours. That config should contain the naames or addresses of all Elasticsearch nodes in your cluster.
Training fails when trying to write the index to Elasticsearch.
What is your cluster config, what nodes exist and how are they named? This config looks like a single machine running multiple services since you are using Spark in “local. Howe do you resolve the name “elasticsearch” to a URI?
On Jun 9, 2022, at 8:52 PM, Julien Marie @.***> wrote:
Here is my configuration
{ "engineId": "recom", "engineFactory": "com.actionml.engines.ur.UREngine", "sparkConf": { "spark.master": "local", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer": "300m", "spark.executor.memory": "10g", "spark.driver.memory": "10g", "spark.es.index.auto.create": "true", "spark.es.nodes": "elasticsearch", "spark.es.nodes.wan.only": "true" }, "algorithm":{ "indicators": [ { "name": "purchase" },{ "name": "cart" },{ "name": "view" },{ "name": "searchpref" } ] } }
And the logs when I launch a training:
03:40:15.618 INFO TaskSchedulerImpl - Removed TaskSet 235.0, whose tasks have all completed, from pool 03:40:15.619 INFO TaskSchedulerImpl - Cancelling stage 235 03:40:15.619 INFO DAGScheduler - ResultStage 235 (runJob at EsSpark.scala:108) failed in 0.212 s due to Job aborted due to stage failure: Task 0 in stage 235.0 failed 1 times, most recent failure: Lost task 0.0 in stage 235.0 (TID 88, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [1/1000]. Error sample (first [5] error messages): org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: if _id is specified it must not be empty {"index":{"_id":""}} {"available":1.0,"id":""}
Bailing out... at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:519) at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:127) at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:192) at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:172) at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:74) at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108) at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Driver stacktrace: 03:40:15.619 INFO DAGScheduler - Job 48 failed: runJob at EsSpark.scala:108, took 0.213707 s 03:40:15.620 ERROR URAlgorithm - Spark computation failed for engine recom with params {{"engineId":"recom","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"spark.master":"local","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.executor.memory":"10g","spark.driver.memory":"10g","spark.es.index.auto.create":"true","spark.es.nodes":"elasticsearch","spark.es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"cart"},{"name":"view"},{"name":"searchpref"}]}}}``` — Reply to this email directly, view it on GitHub https://github.com/actionml/harness/issues/315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPYS7VDBWYGGPMAAYOOU3VOK3WPANCNFSM5YMI65GQ. You are receiving this because you are subscribed to this thread.
I'm running the docker-compose configuration. I can resolve the elasticsearch URI from inside the docker containers without any issue:
docker exec -ti harness /bin/bash
bash-4.4# curl http://elasticsearch:9200/
{
"name" : "0f88f1801410",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "NtUdzNoZQkKSrxnJZsWeqA",
"version" : {
"number" : "7.6.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "7f634e9f44834fbc12724506cc1da681b0c3b1e3",
"build_date" : "2020-02-06T00:09:00.449973Z",
"build_snapshot" : false,
"lucene_version" : "8.4.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Found a fix.
Here is my configuration
And the logs when I launch a training: