PredictionIO / template-scala-parallel-universal-recommendation

PredictiionIO Template for Universal Recommender
113 stars 48 forks source link

Training fails with [ERROR] [TaskSchedulerImpl] Lost executor driver on localhost: Executor heartbeat timed out after 131705 ms #54

Open unoexperto opened 7 years ago

unoexperto commented 7 years ago

Hey guys!

It's been almost two weeks since I started trying to use recommender :) My configuration is

1) HBase 1.2.2 or Postgresql 9.5.4; 2) PIO 0.9.7 and Recommender 0.4.2 with Apache Mahout from master branch; 3) ElasticSearch 1.7.5; 4) spark-1.6.2-bin-hadoop2.6 5) 212M events in the database; 6) Laptop with Ubuntu and 64G of RAM.

Up until right now I thought problem is either in HBase or Recommender. So I tried training with version 0.3.0 which failed too. Then I thought problem is in HBase so today I switched to Posgresql and saw same vague errors in the log of executing pio train -- --driver-memory 12G --executor-memory 12G -- --driver-class-path /home/expert/.IntelliJIdea2016.3/config/jdbc-drivers/postgresql-9.4-1201.jdbc4.jar

[WARN] [Utils] Your hostname, expert-x220 resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface wlp3s0)
[WARN] [Utils] Set SPARK_LOCAL_IP if you need to bind to another address
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.2:33799]
[INFO] [DataSource] 
╔════════════════════════════════════════════════════════════╗
║ Init DataSource                                            ║
║ ══════════════════════════════════════════════════════════ ║
║ App name                      case                         ║
║ Event window                  None                         ║
║ Event names                   List(get_article, view_article, like_article, save_article, share_article, category_preference) ║
╚════════════════════════════════════════════════════════════╝

[INFO] [URAlgorithm] 
╔════════════════════════════════════════════════════════════╗
║ Init URAlgorithm                                           ║
║ ══════════════════════════════════════════════════════════ ║
║ App name                      case                         ║
║ ES index name                 urindex                      ║
║ ES type name                  items                        ║
║ RecsModel                     all                          ║
║ Event names                   List(get_article, view_article, like_article, save_article, share_article, category_preference) ║
║ ══════════════════════════════════════════════════════════ ║
║ Random seed                   2081276143                   ║
║ MaxCorrelatorsPerEventType    50                           ║
║ MaxEventsPerEventType         500                          ║
║ ══════════════════════════════════════════════════════════ ║
║ User bias                     1.0                          ║
║ Item bias                     1.0                          ║
║ Max query events              100                          ║
║ Limit                         20                           ║
║ ══════════════════════════════════════════════════════════ ║
║ Rankings:                                                  ║
║ popular                       Some(popRank)                ║
╚════════════════════════════════════════════════════════════╝

[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.DataSource@29170a47
[INFO] [Engine$] Preparator: org.template.Preparator@13ef7fa1
[INFO] [Engine$] AlgorithmList: List(org.template.URAlgorithm@4e8598d9)
[INFO] [Engine$] Data sanity check is on.
[Stage 0:============================================>              (3 + 1) / 4][WARN] [NettyRpcEnv] Ignored message: HeartbeatResponse(false)
[WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 1 attempts
[Stage 0:============================================>              (3 + 1) / 4][WARN] [transport] [Gargouille] Transport response handler not found of id [72]
[WARN] [HeartbeatReceiver] Removing executor driver with no recent heartbeats: 131705 ms exceeds timeout 120000 ms
[Stage 0:============================================>              (3 + 1) / 4][WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 2 attempts
[ERROR] [TaskSchedulerImpl] Lost executor driver on localhost: Executor heartbeat timed out after 131705 ms
[Stage 0:============================================>              (3 + 1) / 4][WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 3 attempts

my engine.json looks like this

{
  "comment": " This config file uses default settings for all but the required values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params": {
      "name": "views",
      "appName": "case",
      "eventNames": [
        "get_article",
        "view_article",
        "like_article",
        "save_article",
        "share_article",
        "category_preference"
      ]
    }
  },
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "12g",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
      "name": "ur",
      "params": {
        "appName": "case",
        "indexName": "urindex",
        "typeName": "items",
        "comment": "must have data for the first event or the model will not build, other events are optional",
        "eventNames": [
          "get_article",
          "view_article",
          "like_article",
          "save_article",
          "share_article",
          "category_preference"
        ]
      }
    }
  ]
}

What am I missing ? How to make it work ?

pferrel commented 7 years ago

24g or ram allocated to Spark for this size data may be too small. This is often why you get heartbeat failures because responses get too slow with lots of memory to disk swapping. When you export the data how large is the entire exported JSON?

Try increasing the executor and driver memory together. Since Spark is only used during training it is not ideal to reserver this much on your single machine since most of the time you will be accepting input or responding to queries. So in a production environment you would have external driver and executor machines, create them for training and destroy or stop them so you don’t pay for them when they are not needed. Since you are experimenting with your laptop you can also reduce the amount of data until it fits the max size of your executor and driver memory.

You should also remove the amount of executor memory in Engine.json since it will (IIRC) override what you put on the CLI and anyway we are deprecating sparkConf in engine.json.

BTW Please post to the Google group cced above.

On Nov 19, 2016, at 2:30 AM, unoexperto notifications@github.com wrote:

Hey guys!

It's been almost two weeks since I started trying to use recommender :) My configuration is

HBase 1.2.2 or Postgresql 9.5.4; PIO 0.9.7 and Recommender 0.4.2 with Apache Mahout from master branch; ElasticSearch 1.7.5; spark-1.6.2-bin-hadoop2.6 212M events in the database; Laptop with Ubuntu and 64G of RAM. Up until right now I thought problem is either in HBase or Recommender. So I tried training with version 0.3.0 which failed too. Then I thought problem is in HBase so today I switched to Posgresql and saw same vague errors in the log of executing pio train -- --driver-memory 12G --executor-memory 12G -- --driver-class-path /home/expert/.IntelliJIdea2016.3/config/jdbc-drivers/postgresql-9.4-1201.jdbc4.jar

[WARN] [Utils] Your hostname, expert-x220 resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface wlp3s0) [WARN] [Utils] Set SPARK_LOCAL_IP if you need to bind to another address [INFO] [Remoting] Starting remoting [INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.2:33799] [INFO] [DataSource] ╔════════════════════════════════════════════════════════════╗ ║ Init DataSource ║ ║ ══════════════════════════════════════════════════════════ ║ ║ App name case ║ ║ Event window None ║ ║ Event names List(get_article, view_article, like_article, save_article, share_article, category_preference) ║ ╚════════════════════════════════════════════════════════════╝

[INFO] [URAlgorithm] ╔════════════════════════════════════════════════════════════╗ ║ Init URAlgorithm ║ ║ ══════════════════════════════════════════════════════════ ║ ║ App name case ║ ║ ES index name urindex ║ ║ ES type name items ║ ║ RecsModel all ║ ║ Event names List(get_article, view_article, like_article, save_article, share_article, category_preference) ║ ║ ══════════════════════════════════════════════════════════ ║ ║ Random seed 2081276143 ║ ║ MaxCorrelatorsPerEventType 50 ║ ║ MaxEventsPerEventType 500 ║ ║ ══════════════════════════════════════════════════════════ ║ ║ User bias 1.0 ║ ║ Item bias 1.0 ║ ║ Max query events 100 ║ ║ Limit 20 ║ ║ ══════════════════════════════════════════════════════════ ║ ║ Rankings: ║ ║ popular Some(popRank) ║ ╚════════════════════════════════════════════════════════════╝

[INFO] [Engine$] EngineWorkflow.train [INFO] [Engine$] DataSource: org.template.DataSource@29170a47 [INFO] [Engine$] Preparator: org.template.Preparator@13ef7fa1 [INFO] [Engine$] AlgorithmList: List(org.template.URAlgorithm@4e8598d9) [INFO] [Engine$] Data sanity check is on. [Stage 0:============================================> (3 + 1) / 4][WARN] [NettyRpcEnv] Ignored message: HeartbeatResponse(false) [WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 1 attempts [Stage 0:============================================> (3 + 1) / 4][WARN] [transport] [Gargouille] Transport response handler not found of id [72] [WARN] [HeartbeatReceiver] Removing executor driver with no recent heartbeats: 131705 ms exceeds timeout 120000 ms [Stage 0:============================================> (3 + 1) / 4][WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 2 attempts [ERROR] [TaskSchedulerImpl] Lost executor driver on localhost: Executor heartbeat timed out after 131705 ms [Stage 0:============================================> (3 + 1) / 4][WARN] [NettyRpcEndpointRef] Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@3f84e497,BlockManagerId(driver, localhost, 38773))] in 3 attempts my engine.json looks like this

{ "comment": " This config file uses default settings for all but the required values see README.md for docs", "id": "default", "description": "Default settings", "engineFactory": "org.template.RecommendationEngine", "datasource": { "params": { "name": "views", "appName": "case", "eventNames": [ "get_article", "view_article", "like_article", "save_article", "share_article", "category_preference" ] } }, "sparkConf": { "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer": "300m", "spark.executor.memory": "12g", "es.index.auto.create": "true" }, "algorithms": [ { "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames", "name": "ur", "params": { "appName": "case", "indexName": "urindex", "typeName": "items", "comment": "must have data for the first event or the model will not build, other events are optional", "eventNames": [ "get_article", "view_article", "like_article", "save_article", "share_article", "category_preference" ] } } ] } What am I missing ? How to make it work ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PredictionIO/template-scala-parallel-universal-recommendation/issues/54, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT8S2_lMOuIZ5WfPCgyOHVjRyBPrIaKks5q_s-5gaJpZM4K3OEv.

unoexperto commented 7 years ago

@pferrel Looks like github cut off address you've specified in CC field.

json with source events is ~29.7GB. Would you like me to upload it somewhere ? I don't mind training to take more time. But I don't want it to crash. What configuration options would you recommend me to change ?

pferrel commented 7 years ago

https://groups.google.com/forum/#!forum/actionml-user https://groups.google.com/forum/#!forum/actionml-user

On Nov 19, 2016, at 12:47 PM, unoexperto notifications@github.com wrote:

@pferrel https://github.com/pferrel Looks like github cut off address you've specified in CC field.

json with source events is ~29.7GB. Would you like me to upload it somewhere ? I don't mind training to take more time. But I don't want it to crash. What configuration options would you recommend me to change ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PredictionIO/template-scala-parallel-universal-recommendation/issues/54#issuecomment-261738645, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT8S_1KxHwyHJuHrAGXu98O_xmOkKtNks5q_2BvgaJpZM4K3OEv.