Intel-bigdata / HiBench

HiBench is a big data benchmark suite.
Other
1.45k stars 761 forks source link

kmeans & target=spark/python: storage.MemoryStore: Not enough space #98

Closed C-Newman closed 9 years ago

C-Newman commented 9 years ago

Running kmeans with the spark/python target, I see many instances of lost executors, like this:

15/05/28 09:46:09 WARN TaskSetManager: Lost task 16.0 in stage 15.0 (TID 479, anders-41): ExecutorLostFailure (executor 11 lost) 15/05/28 09:46:09 WARN TaskSetManager: Lost task 22.0 in stage 15.0 (TID 482, anders-41): ExecutorLostFailure (executor 11 lost) 15/05/28 09:46:09 WARN TaskSetManager: Lost task 24.0 in stage 15.0 (TID 484, anders-41): ExecutorLostFailure (executor 11 lost) ... 15/05/28 09:46:26 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@anders-41:47915/user/Executor#707073176] with ID 15 15/05/28 09:46:26 INFO BlockManagerMasterActor: Registering block manager anders-41:41984 with 2.1 GB RAM, BlockManagerId(15, anders-41, 41984)

I caught one stderr file (/var/log/hadoop-yarn/container/application_1432738106848_0062/container_1432738106848_0062_01_000005/stderr) with this: 15/05/28 09:29:29 INFO storage.BlockManager: Found block rdd_5_17 locally 15/05/28 09:29:30 WARN storage.MemoryStore: Not enough space to cache rdd_15_17 in memory! (computed 14.6 MB so far) 15/05/28 09:29:30 INFO storage.MemoryStore: Memory use = 1334.2 MB (blocks) + 778.6 MB (scratch space shared across 7 thread(s)) = 2.1 GB. Storage limit = 2.1 GB. 15/05/28 09:29:31 INFO executor.Executor: Finished task 17.0 in stage 6.0 (TID 190). 2474 bytes result sent to driver

I changed both executor.memory and spark.driver.memory to 4G in hadoop-hdfs/HiBench/HiBench-master/conf/99-user_defined_properties.conf. spark-submit is run with "--executor-memory 4G --driver-memory 4G"

In hadoop-hdfs/HiBench/HiBench-master/report/kmeans/spark/python/conf/../bench.log, however, I still see 2.1 GB being used: 15/05/28 11:10:24 INFO MemoryStore: MemoryStore started with capacity 2.1 GB

(I presume the 2.1 is because 2 GiB ~= 2.1 GB (2.147483648) )

C-Newman commented 9 years ago

I made changes in these three files: /etc/spark/conf/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G /etc/spark/conf.dist/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G /etc/spark/conf/spark-defaults.conf spark.executor.memory=3.5G restarted spark, re-ran build-all.sh and run-all.sh, and I still got 2.1G

lvsoft commented 9 years ago

Please don't set configurations in spark conf, leave it blank or default. You can set spark.* properties in 99-user_defined_properties.conf, which will overwrite values in spark conf directory.

In your case, seems you'll need to set spark.storage.memoryFraction to 1 in 99-user_defined_properties.conf to let spark context utilize all your memory. By default spark.storage.memoryFraction is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)

C-Newman commented 9 years ago

Thanks for your quick reply.

These servers have 32 GiB of memory. How does spark decide I have only ~4 GiB of memory?

Your suggestion of increasing spark.storage.memoryFraction worked; I set it to 0.9 and they are now getting 3.1 GB of memory: 15/05/29 08:58:49 INFO MemoryStore: MemoryStore started with capacity 3.1 GB

I am no longer getting the "Not enough space to cache" warnings in the container stderr files.

I'm still losing executors, however, and it appears to be related to Akka. The output on the terminal from run-all.sh has: 15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated 15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost)

/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/kmeans/spark/python/bench.log has: 15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated 15/05/29 09:24:55 INFO TaskSetManager: Re-queueing tasks for 11 from TaskSet 7.0 15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost) 15/05/29 09:24:55 INFO DAGScheduler: Executor lost: 11 (epoch 10) 15/05/29 09:24:55 INFO BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster. 15/05/29 09:24:55 INFO BlockManagerMasterActor: Removing block manager BlockManagerId(11, anders-44, 51603) 15/05/29 09:24:55 INFO TaskSetManager: Starting task 26.1 in stage 7.0 (TID 290, d, NODE_LOCAL, 1950 bytes) 15/05/29 09:24:55 INFO BlockManagerMaster: Removed 11 successfully in removeExecutor 15/05/29 09:25:02 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@anders-44:47219/user/Executor#1407812609] with ID 13

From: Lv, Qi [mailto:notifications@github.com] Sent: Friday, 29 May, 2015 03:36 To: intel-hadoop/HiBench Cc: Newman, Chuck Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Please don't set configurations in spark conf, leave it blank or default. You can set spark.* properties in 99-user_defined_properties.conf, which will overwrite values in spark conf directory.

In your case, seems you'll need to set spark.storage.memoryFraction to 1 in 99-user_defined_properties.conf to let spark context utilize all your memory. By default spark.storage.memoryFraction is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)

— Reply to this email directly or view it on GitHubhttps://github.com/intel-hadoop/HiBench/issues/98#issuecomment-106727405.

lvsoft commented 9 years ago

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report/<workload>/spark/python/conf/sparkbench/sparkbench.conf and report/<workload>/spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs/<your container>/<executor>/stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.

C-Newman commented 9 years ago

Attached is a .tar.gz file with the files you requested.

There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages

... May 28 22:01:51 anders-41 abrt: detected unhandled Python exception May 29 08:58:44 anders-41 abrt: detected unhandled Python exception May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' May 29 10:00:38 anders-41 abrtd: New client connected May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433 May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no' May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1 May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' May 29 10:00:41 anders-41 abrt: detected unhandled Python exception May 29 10:20:33 anders-41 abrt: detected unhandled Python exception ...

From: Lv, Qi [mailto:notifications@github.com] Sent: Friday, 29 May, 2015 10:41 To: intel-hadoop/HiBench Cc: Newman, Chuck Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find /logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.

— Reply to this email directly or view it on GitHubhttps://github.com/intel-hadoop/HiBench/issues/98#issuecomment-106831188.

C-Newman commented 9 years ago

Should I be concerned about the warning I highlighted below? I see this for most, if not all, of the HiBench jobs:

pagerank with spark/python suffers a similar problem with lost executors, but finally fails with a different error: start HadoopPreparePagerank bench hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input Submit MapReduce Job: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop jar /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/autogen/target/autogen-4.0-SNAPSHOT-jar-with-dependencies.jar HiBench.DataGen -t pagerank -b /usr/lib/hadoop-hdfs/HiBench/Pagerank -n Input -m 6 -r 2 -p 500000 -pbalance -pbalance -o text 15/05/29 16:25:44 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/05/29 16:25:59 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/05/29 16:26:13 INFO HiBench.PagerankData: Closing pagerank data generator... finish HadoopPreparePagerank bench Run pagerank/spark/python Exec script: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/bin/run.sh Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/00-default-properties.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/10-data-scale-profile.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/99-user_defined_properties.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/00-pagerank-default.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/10-pagerank-userdefine.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/properties.conf Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/python.conf start PythonSparkPagerank bench hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/sparkbench.conf Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/pagerank.py /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input/edges /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output 3 15/05/29 16:29:10 ERROR YarnScheduler: Lost executor 1 on anders-44: remote Akka client disassociated 15/05/29 16:29:10 WARN TaskSetManager: Lost task 8.0 in stage 2.0 (TID 20, anders-44): ExecutorLostFailure (executor 1 lost) 15/05/29 16:29:10 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 14, anders-44): ExecutorLostFailure (executor 1 lost) 15/05/29 16:29:10 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 22, anders-44): ExecutorLostFailure (executor 1 lost) 15/05/29 16:29:10 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 16, anders-44): ExecutorLostFailure (executor 1 lost) 15/05/29 16:29:10 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 12, anders-44): ExecutorLostFailure (executor 1 lost) 15/05/29 16:29:12 ERROR YarnScheduler: Lost executor 2 on anders-41: remote Akka client disassociated 15/05/29 16:29:12 WARN TaskSetManager: Lost task 5.0 in stage 2.0 (TID 17, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 13, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 4.1 in stage 2.0 (TID 25, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 15, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 24, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.0 in stage 0.1 (TID 26, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 2.0 in stage 0.1 (TID 28, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 0.1 (TID 27, anders-41): ExecutorLostFailure (executor 2 lost) 15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4 at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

Here's the tail of the log file: 15/05/29 16:29:14 INFO BlockManagerMasterActor: Registering block manager anders-44:32963 with 3.1 GB RAM, BlockManagerId(3, anders-44, 32963) 15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on anders-44:32963 (size: 11.6 KB, free: 3.1 GB) 15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on anders-44:32963 (size: 8.1 KB, free: 3.1 GB) 15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on anders-44:32963 (size: 64.2 KB, free: 3.1 GB) 15/05/29 16:29:14 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 4 to sparkExecutor@anders-44:42351 15/05/29 16:29:14 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 4 is 86 bytes 15/05/29 16:29:14 INFO TaskSetManager: Starting task 9.1 in stage 2.0 (TID 37, anders-44, PROCESS_LOCAL, 1255 bytes) 15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

join with spark/python fails differently: start PythonSparkJoin bench hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Join/Input hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Join/Output Deleted /usr/lib/hadoop-hdfs/HiBench/Join/Output Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/sparkbench.conf Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/python_spark_sql_bench.py PythonJoin /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/../rankings_uservisits_join.hive 15/05/29 16:25:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0 15/05/29 16:25:19 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/05/29 16:25:30 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] finish PythonSparkJoin bench

From: Newman, Chuck Sent: Friday, 29 May, 2015 16:11 To: intel-hadoop/HiBench; intel-hadoop/HiBench Cc: Newman, Chuck Subject: RE: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Attached is a .tar.gz file with the files you requested.

There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages

... May 28 22:01:51 anders-41 abrt: detected unhandled Python exception May 29 08:58:44 anders-41 abrt: detected unhandled Python exception May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' May 29 10:00:38 anders-41 abrtd: New client connected May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433 May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no' May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1 May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' May 29 10:00:41 anders-41 abrt: detected unhandled Python exception May 29 10:20:33 anders-41 abrt: detected unhandled Python exception ...

From: Lv, Qi [mailto:notifications@github.com] Sent: Friday, 29 May, 2015 10:41 To: intel-hadoop/HiBench Cc: Newman, Chuck Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find /logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.

— Reply to this email directly or view it on GitHubhttps://github.com/intel-hadoop/HiBench/issues/98#issuecomment-106831188.

lvsoft commented 9 years ago

Sorry, could you attach the files in some place and placed an URL here? I can't see your attachments.

And besides, according to your /var/log/messages, the May 28 22:01:51 anders-41 abrt: detected unhandled Python exception exceptions is generated by ABRT( Automated BUG Reporting Tools), which hooked into Python to override sys.excepthook.

I'm not sure about ABRT, but seems it intercepted python exception handling and failed to produce a successful report. My suggestion is, you can uninstall ABRT, then the exception in python should be printed in console or logs, which should be a clue for tracing.

C-Newman commented 9 years ago

Qi Lv --

I was able to determine that I was encountering an OOM problem.

I changed hibench.scale.profile to tiny, and now run-all.sh runs to completion.

However, It looks like there is still a problem.

I get this from aggregation with spark/scala:

start ScalaSparkAggregation bench

hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

Deleted /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf

Submit Spark job: /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive

15/06/05 16:17:47 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0

15/06/05 16:17:48 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

15/06/05 16:17:54 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

finish ScalaSparkAggregation bench

I saw these three commands in the output, so I ran them separately: $ /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/aggregation/prepare/prepare.sh $ /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output $ SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive

It seemed to progress well until this point, in which it complained that the uservisits table already exists:

15/06/08 10:38:20 INFO HiveMetaStore: 0: create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)

15/06/08 10:38:20 INFO audit: ugi=hdfs ip=unknown-ip-addr cmd=create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)

15/06/08 10:38:20 ERROR RetryingHMSHandler: AlreadyExistsException(message:Table uservisits already exists)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1349)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1426)

The trace continues for quite a while, ending with this:

15/06/08 10:38:20 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=Driver.execute start=1433774299954 end=1433774300289 duration=335 from=org.apache.hadoop.hive.ql.Driver>

15/06/08 10:38:20 INFO PerfLogger:

15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=releaseLocks start=1433774300289 end=1433774300289 duration=0 from=org.apache.hadoop.hive.ql.Driver>

15/06/08 10:38:20 ERROR HiveContext:

HIVE FAILURE OUTPUT

SET spark.sql.shuffle.partitions=6

OK

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

SET mapreduce.job.maps=6

SET mapreduce.job.reduces=2

SET hive.stats.autogather=false

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

END HIVE FAILURE OUTPUT

Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:312)
mckellyln commented 9 years ago

Is that you C-Newman ?

lvsoft commented 9 years ago

@C-Newman yes it will complain AlreadyExistsException if you execute the command line manually just after you execute the workload via hibench's script. You'll need to remove metastore_db folder in your current pwd before you try the cmd manully, which is created by previous SparkSQL releated operation and will be cleaned in hibench's workload script.

C-Newman commented 9 years ago

mckellyln, yes it's me. I sent a reply to the email address I last had from you, then checked its domain name and it appears to have lapsed.

mckellyln commented 9 years ago

@C-Newman, so good to hear from you! I will send you an email offline.