actionml / harness

Harness is a Machine Learning/AI Server with plugins for many algorithms including the Universal Recommender
Apache License 2.0
283 stars 49 forks source link

Training fails - EsHadoopRemoteException: illegal_argument_exception: if _id is specified it must not be empty #315

Closed julienmarie closed 2 years ago

julienmarie commented 2 years ago

Here is my configuration

{
    "engineId": "recom",
    "engineFactory": "com.actionml.engines.ur.UREngine",
    "sparkConf": {
        "spark.master": "local",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
        "spark.kryo.referenceTracking": "false",
        "spark.kryoserializer.buffer": "300m",
        "spark.executor.memory": "10g",
        "spark.driver.memory": "10g",
        "spark.es.index.auto.create": "true",
        "spark.es.nodes": "elasticsearch",
        "spark.es.nodes.wan.only": "true"
    },
    "algorithm":{
        "indicators": [
            {
                "name": "purchase"
            },{
                "name": "cart"
            },{
                "name": "view"
            },{
                "name": "searchpref"
            }
        ]
    }
}

And the logs when I launch a training:


03:40:15.618 INFO  TaskSchedulerImpl - Removed TaskSet 235.0, whose tasks have all completed, from pool
03:40:15.619 INFO  TaskSchedulerImpl - Cancelling stage 235
03:40:15.619 INFO  DAGScheduler      - ResultStage 235 (runJob at EsSpark.scala:108) failed in 0.212 s due to Job aborted due to stage failure: Task 0 in stage 235.0 failed 1 times, most recent failure: Lost task 0.0 in stage 235.0 (TID 88, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [1/1000]. Error sample (first [5] error messages):
    org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: if _id is specified it must not be empty
    {"index":{"_id":""}}
{"available":1.0,"id":""}

Bailing out...
    at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:519)
    at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:127)
    at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:192)
    at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:172)
    at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:74)
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
03:40:15.619 INFO  DAGScheduler      - Job 48 failed: runJob at EsSpark.scala:108, took 0.213707 s
03:40:15.620 ERROR URAlgorithm       - Spark computation failed for engine recom with params {{"engineId":"recom","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"spark.master":"local","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.executor.memory":"10g","spark.driver.memory":"10g","spark.es.index.auto.create":"true","spark.es.nodes":"elasticsearch","spark.es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"cart"},{"name":"view"},{"name":"searchpref"}]}}}```
julienmarie commented 2 years ago

It's running on Harness with Docker-compose Version 1.1-SNAPSHOT

pferrel commented 2 years ago

Does the name “elasticsearch” resolve to your Elasticsearch node address? Does the integration test complete correctly? If so compare the value of “spark.es.nodes” in the integration test config to yours. That config should contain the naames or addresses of all Elasticsearch nodes in your cluster.

Training fails when trying to write the index to Elasticsearch.

What is your cluster config, what nodes exist and how are they named? This config looks like a single machine running multiple services since you are using Spark in “local. Howe do you resolve the name “elasticsearch” to a URI?

On Jun 9, 2022, at 8:52 PM, Julien Marie @.***> wrote:

Here is my configuration

{ "engineId": "recom", "engineFactory": "com.actionml.engines.ur.UREngine", "sparkConf": { "spark.master": "local", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer": "300m", "spark.executor.memory": "10g", "spark.driver.memory": "10g", "spark.es.index.auto.create": "true", "spark.es.nodes": "elasticsearch", "spark.es.nodes.wan.only": "true" }, "algorithm":{ "indicators": [ { "name": "purchase" },{ "name": "cart" },{ "name": "view" },{ "name": "searchpref" } ] } }

And the logs when I launch a training:

03:40:15.618 INFO TaskSchedulerImpl - Removed TaskSet 235.0, whose tasks have all completed, from pool 03:40:15.619 INFO TaskSchedulerImpl - Cancelling stage 235 03:40:15.619 INFO DAGScheduler - ResultStage 235 (runJob at EsSpark.scala:108) failed in 0.212 s due to Job aborted due to stage failure: Task 0 in stage 235.0 failed 1 times, most recent failure: Lost task 0.0 in stage 235.0 (TID 88, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [1/1000]. Error sample (first [5] error messages): org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: if _id is specified it must not be empty {"index":{"_id":""}} {"available":1.0,"id":""}

Bailing out... at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:519) at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:127) at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:192) at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:172) at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:74) at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108) at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:108) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: 03:40:15.619 INFO DAGScheduler - Job 48 failed: runJob at EsSpark.scala:108, took 0.213707 s 03:40:15.620 ERROR URAlgorithm - Spark computation failed for engine recom with params {{"engineId":"recom","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"spark.master":"local","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.executor.memory":"10g","spark.driver.memory":"10g","spark.es.index.auto.create":"true","spark.es.nodes":"elasticsearch","spark.es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"cart"},{"name":"view"},{"name":"searchpref"}]}}}``` — Reply to this email directly, view it on GitHub https://github.com/actionml/harness/issues/315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPYS7VDBWYGGPMAAYOOU3VOK3WPANCNFSM5YMI65GQ. You are receiving this because you are subscribed to this thread.

julienmarie commented 2 years ago

I'm running the docker-compose configuration. I can resolve the elasticsearch URI from inside the docker containers without any issue:

docker exec -ti harness /bin/bash
bash-4.4# curl http://elasticsearch:9200/
{
  "name" : "0f88f1801410",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "NtUdzNoZQkKSrxnJZsWeqA",
  "version" : {
    "number" : "7.6.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "7f634e9f44834fbc12724506cc1da681b0c3b1e3",
    "build_date" : "2020-02-06T00:09:00.449973Z",
    "build_snapshot" : false,
    "lucene_version" : "8.4.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
julienmarie commented 2 years ago

Found a fix.