apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.75k stars 6.8k forks source link

Spark Test Failure: IllegalArgumentException: requirement failed: Failed to start ps scheduler #11249

Open marcoabreu opened 6 years ago

marcoabreu commented 6 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11246/1/pipeline/ Exception in thread "Thread-113" java.lang.IllegalArgumentException: requirement failed: Failed to start ps scheduler process with exit code 134

18/06/12 17:16:53 INFO Utils: /work/mxnet/scala-package/assembly/linux-x86_64-cpu/target/mxnet-full_2.11-linux-x86_64-cpu-1.3.0-SNAPSHOT.jar has been previously copied to /tmp/spark-202cac2e-62d9-4ac4-a99a-27bca2aaf87f/userFiles-2f1680a1-aea4-4de0-8784-3300ad265be7/mxnet-full_2.11-linux-x86_64-cpu-1.3.0-SNAPSHOT.jar

18/06/12 17:16:53 INFO Executor: Fetching file:/work/mxnet/scala-package/spark/target/mxnet-spark_2.11-1.3.0-SNAPSHOT.jar with timestamp 1528823813012

18/06/12 17:16:53 INFO Utils: /work/mxnet/scala-package/spark/target/mxnet-spark_2.11-1.3.0-SNAPSHOT.jar has been previously copied to /tmp/spark-202cac2e-62d9-4ac4-a99a-27bca2aaf87f/userFiles-2f1680a1-aea4-4de0-8784-3300ad265be7/mxnet-spark_2.11-1.3.0-SNAPSHOT.jar

18/06/12 17:16:53 INFO MXNet: Starting server ...

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:234881024+8528911

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:67108864+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:167772160+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:201326592+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:134217728+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:33554432+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:0+33554432

18/06/12 17:16:53 INFO HadoopRDD: Input split: file:/tmp/mxnet-spark-test-15288237675593208593691818310660/train.txt:100663296+33554432

18/06/12 17:16:53 INFO ParameterServer: Started process: java  -cp /tmp/spark-202cac2e-62d9-4ac4-a99a-27bca2aaf87f/userFiles-2f1680a1-aea4-4de0-8784-3300ad265be7/mxnet-full_2.11-linux-x86_64-cpu-1.3.0-SNAPSHOT.jar:/tmp/spark-202cac2e-62d9-4ac4-a99a-27bca2aaf87f/userFiles-2f1680a1-aea4-4de0-8784-3300ad265be7/mxnet-spark_2.11-1.3.0-SNAPSHOT.jar org.apache.mxnet.spark.ParameterServer --role=server --root-uri=172.17.0.4 --root-port=45669 --num-server=1 --num-worker=2 --timeout=300 at 172.17.0.4:45669

18/06/12 17:16:53 INFO ParameterServer: Starting InputStream-Redirecter Thread for 172.17.0.4:45669

18/06/12 17:16:53 INFO ParameterServer: Starting ErrorStream-Redirecter Thread for 172.17.0.4:45669

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

18/06/12 17:16:53 INFO Executor: Finished task 7.0 in stage 1.0 (TID 8). 2254 bytes result sent to driver

18/06/12 17:16:53 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 8) in 530 ms on localhost (1/8)

18/06/12 17:16:54 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1083 ms on localhost (2/8)

18/06/12 17:16:54 INFO Executor: Finished task 5.0 in stage 1.0 (TID 6). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 6) in 1092 ms on localhost (3/8)

18/06/12 17:16:54 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 1093 ms on localhost (4/8)

18/06/12 17:16:54 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 1094 ms on localhost (5/8)

18/06/12 17:16:54 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 7) in 1097 ms on localhost (6/8)

18/06/12 17:16:54 INFO Executor: Finished task 3.0 in stage 1.0 (TID 4). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 4) in 1100 ms on localhost (7/8)

18/06/12 17:16:54 INFO Executor: Finished task 4.0 in stage 1.0 (TID 5). 2254 bytes result sent to driver

18/06/12 17:16:54 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 1125 ms on localhost (8/8)

18/06/12 17:16:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 

18/06/12 17:16:54 INFO DAGScheduler: ShuffleMapStage 1 (repartition at MXNet.scala:251) finished in 1.126 s

18/06/12 17:16:54 INFO DAGScheduler: looking for newly runnable stages

18/06/12 17:16:54 INFO DAGScheduler: running: Set(ResultStage 0)

18/06/12 17:16:54 INFO DAGScheduler: waiting: Set(ResultStage 2)

18/06/12 17:16:54 INFO DAGScheduler: failed: Set()

18/06/12 17:16:54 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at mapPartitions at MXNet.scala:209), which has no missing parents

18/06/12 17:16:54 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 8.6 KB, free 19.8 GB)

18/06/12 17:16:54 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.8 KB, free 19.8 GB)

18/06/12 17:16:54 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:38790 (size: 3.8 KB, free: 19.8 GB)

18/06/12 17:16:54 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

18/06/12 17:16:54 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at mapPartitions at MXNet.scala:209)

18/06/12 17:16:54 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks

18/06/12 17:16:54 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 9, localhost, partition 0,NODE_LOCAL, 2296 bytes)

18/06/12 17:16:54 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 10, localhost, partition 1,NODE_LOCAL, 2296 bytes)

18/06/12 17:16:54 INFO Executor: Running task 0.0 in stage 2.0 (TID 9)

18/06/12 17:16:54 INFO Executor: Running task 1.0 in stage 2.0 (TID 10)

18/06/12 17:16:54 INFO CacheManager: Partition rdd_8_0 not found, computing it

18/06/12 17:16:54 INFO CacheManager: Partition rdd_8_1 not found, computing it

18/06/12 17:16:54 INFO ShuffleBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks

18/06/12 17:16:54 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

18/06/12 17:16:54 INFO ShuffleBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks

18/06/12 17:16:54 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

18/06/12 17:16:56 INFO MXNet: Launching worker ...

18/06/12 17:16:56 INFO MXNet: Batch 128

18/06/12 17:16:56 INFO MXNet: Launching worker ...

18/06/12 17:16:56 INFO MXNet: Batch 128

18/06/12 17:17:16 INFO MXNet: Start training ...

18/06/12 17:17:16 INFO MXNet: Start training ...

18/06/12 17:17:16 INFO DataParallelExecutorManager: Start training with [cpu(0),cpu(1)]

18/06/12 17:17:16 INFO DataParallelExecutorManager: Start training with [cpu(0),cpu(1)]

18/06/12 17:18:20 INFO Model: Epoch[0] Train-accuracy=0.85657054

18/06/12 17:18:20 INFO Model: Epoch[0] Time cost=63440

18/06/12 17:18:20 INFO Model: Epoch[0] Train-accuracy=0.8499599

18/06/12 17:18:20 INFO Model: Epoch[0] Time cost=63442

18/06/12 17:18:59 INFO Model: Epoch[1] Train-accuracy=0.9565972

18/06/12 17:18:59 INFO Model: Epoch[1] Time cost=38775

18/06/12 17:18:59 INFO Model: Epoch[1] Train-accuracy=0.95492786

18/06/12 17:18:59 INFO Model: Epoch[1] Time cost=38779

18/06/12 17:19:32 INFO Model: Epoch[2] Train-accuracy=0.9720219

18/06/12 17:19:32 INFO Model: Epoch[2] Time cost=33500

18/06/12 17:19:32 INFO Model: Epoch[2] Train-accuracy=0.97215545

18/06/12 17:19:32 INFO Model: Epoch[2] Time cost=33507

18/06/12 17:19:58 INFO Model: Epoch[3] Train-accuracy=0.9793002

18/06/12 17:19:58 INFO Model: Epoch[3] Time cost=25923

18/06/12 17:19:58 INFO Model: Epoch[3] Train-accuracy=0.9788996

18/06/12 17:19:58 INFO Model: Epoch[3] Time cost=26168

18/06/12 17:20:29 INFO Model: Epoch[4] Train-accuracy=0.9821715

18/06/12 17:20:29 INFO Model: Epoch[4] Time cost=30778

18/06/12 17:20:29 INFO Model: Epoch[4] Train-accuracy=0.9831063

18/06/12 17:20:29 INFO Model: Epoch[4] Time cost=30799

18/06/12 17:20:59 INFO Model: Epoch[5] Train-accuracy=0.9841079

18/06/12 17:20:59 INFO Model: Epoch[5] Time cost=29636

18/06/12 17:20:59 INFO Model: Epoch[5] Train-accuracy=0.9863782

18/06/12 17:20:59 INFO Model: Epoch[5] Time cost=29593

18/06/12 17:21:24 INFO Model: Epoch[6] Train-accuracy=0.9855769

18/06/12 17:21:24 INFO Model: Epoch[6] Time cost=25173

18/06/12 17:21:24 INFO Model: Epoch[6] Train-accuracy=0.9882479

18/06/12 17:21:24 INFO Model: Epoch[6] Time cost=25212

18/06/12 17:21:51 INFO Model: Epoch[7] Train-accuracy=0.9881143

18/06/12 17:21:51 INFO Model: Epoch[7] Time cost=26722

18/06/12 17:21:51 INFO Model: Epoch[7] Train-accuracy=0.9898504

18/06/12 17:21:51 INFO Model: Epoch[7] Time cost=26627

terminate called without an active exception

terminate called without an active exception

Exception in thread "Thread-113" java.lang.IllegalArgumentException: requirement failed: Failed to start ps scheduler process with exit code 134

    at scala.Predef$.require(Predef.scala:224)

    at org.apache.mxnet.spark.MXNet.org$apache$mxnet$spark$MXNet$$startPSSchedulerInner$1(MXNet.scala:159)

    at org.apache.mxnet.spark.MXNet$$anonfun$startPSScheduler$1.apply(MXNet.scala:162)

    at org.apache.mxnet.spark.MXNet$$anonfun$startPSScheduler$1.apply(MXNet.scala:162)

    at org.apache.mxnet.spark.MXNet$MXNetControllingThread.run(MXNet.scala:38)

18/06/12 17:21:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.IllegalArgumentException: requirement failed: ps server process quit with exit code 134

    at scala.Predef$.require(Predef.scala:224)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:137)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:126)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

    at org.apache.spark.scheduler.Task.run(Task.scala:89)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

18/06/12 17:21:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: requirement failed: ps server process quit with exit code 134

    at scala.Predef$.require(Predef.scala:224)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:137)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:126)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

    at org.apache.spark.scheduler.Task.run(Task.scala:89)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

18/06/12 17:21:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

18/06/12 17:21:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 

18/06/12 17:21:57 INFO TaskSchedulerImpl: Cancelling stage 0

18/06/12 17:21:57 INFO DAGScheduler: ResultStage 0 (foreachPartition at MXNet.scala:126) failed in 304.794 s

18/06/12 17:21:57 INFO DAGScheduler: Job 0 failed: foreachPartition at MXNet.scala:126, took 304.801371 s

Exception in thread "Thread-114" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: requirement failed: ps server process quit with exit code 134

    at scala.Predef$.require(Predef.scala:224)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:137)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:126)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

    at org.apache.spark.scheduler.Task.run(Task.scala:89)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)

    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)

    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)

    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)

    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)

    at org.apache.mxnet.spark.MXNet.org$apache$mxnet$spark$MXNet$$startPSServersInner$1(MXNet.scala:126)

    at org.apache.mxnet.spark.MXNet$$anonfun$startPSServers$1.apply(MXNet.scala:140)

    at org.apache.mxnet.spark.MXNet$$anonfun$startPSServers$1.apply(MXNet.scala:140)

    at org.apache.mxnet.spark.MXNet$MXNetControllingThread.run(MXNet.scala:38)

Caused by: java.lang.IllegalArgumentException: requirement failed: ps server process quit with exit code 134

    at scala.Predef$.require(Predef.scala:224)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:137)

    at org.apache.mxnet.spark.MXNet$$anonfun$org$apache$mxnet$spark$MXNet$$startPSServersInner$1$1.apply(MXNet.scala:126)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

    at org.apache.spark.scheduler.Task.run(Task.scala:89)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

18/06/12 17:46:53 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:38790 in memory (size: 2.4 KB, free: 19.8 GB)

18/06/12 17:46:53 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:38790 in memory (size: 2.7 KB, free: 19.8 GB)

18/06/12 17:46:53 INFO ContextCleaner: Cleaned accumulator 5

Sending interrupt signal to process

build.py: 2018-06-12 19:08:38,620 Running of command in container failed (-15): docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-scala-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-scala-cpu/build:/work/build -v /tmp/ci_ccache:/work/ccache -u 1001:1001 -e CCACHE_MAXSIZE=10G -e CCACHE_DIR=/work/ccache mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_cpu_scala

build.py: 2018-06-12 19:08:38,620 You can try to get into the container by using the following command: docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-scala-cpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-scala-cpu/build:/work/build -v /tmp/ci_ccache:/work/ccache -u 1001:1001 -ti --entrypoint /bin/bash -e CCACHE_MAXSIZE=10G -e CCACHE_DIR=/work/ccache mxnetci/build.ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_cpu_scala

Terminated

script returned exit code 143
anirudh2290 commented 6 years ago

Happens here too: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11127/11/pipeline . Please prioritize fixing this may become a blocker for 1.2.1 release.

lanking520 commented 6 years ago

I broke my Mac's Test, currently working on this...

nswamy commented 6 years ago

I see this on the master branch, is it braking on the 1.2.0 branch as well?

anirudh2290 commented 6 years ago

@nswamy havent seen this yet on 1.2 but need to merge PRs on master before we cherry-pick for 1.2

nswamy commented 6 years ago

This is due to the Spark tests that were newly merged https://github.com/apache/incubator-mxnet/pull/10462. I would like to disable so that we don't block the pipeline, we will activate these tests later. @CodingCat hope its ok with you

CodingCat commented 6 years ago

please go ahead and merge

but according to "terminate called without an active exception", it looks like the docker instance is assigned with too little memory so it is killed in the middle (does anyone how much memory we allocate to docker?)

CodingCat commented 6 years ago

@nswamy when you disable the test, would you please only disable https://github.com/apache/incubator-mxnet/pull/10462/files#diff-b9835287795c57e48fb9da87d39d06a8R64 (LeNet) for now (I believe LeNet is much more memory-intensive ) and leave MLP to guard the correctness in Spark side,

if MLP also behaves flaky, let's go ahead and kill it (I think it will not)

marcoabreu commented 6 years ago

I don't think we assign any limits to Docker. But why does a unittest require so many resources?

larroy commented 6 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11055/26/pipeline/751

larroy commented 6 years ago

@nswamy wouldn't these kind of heavy tests would be more suitable for nightly?

nswamy commented 6 years ago

may be i haven't looked how long it takes to run these tests. We need to have some tests for training as well in the regular pipeline, may be we can run it on GPUs and keep MNIST only on CPU tests.

lanking520 commented 6 years ago

Currently all test I can see from CI build failure comes from CPU test. There is no difference in configuration running on CPU and GPU for the test on Spark currently. I think it might come from the diff between CPU/GPU build. Please correct me if this assumption is wrong.

lanking520 commented 6 years ago

@yzhliu

lanking520 commented 6 years ago

As the PR merged yesterday, let's keep focus if there are more CI died because of Scala. Please close this issue after some days if there is no further crashes.

nswamy commented 6 years ago

This is resolved lets open new if there is an issue

nswamy commented 6 years ago

I meant this is not blocking the pipeline, i'll keep it reopen and edit the Issue title/remove the Flaky label