amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

ImageNet running in Yarn, nodeManager memory keep on increasing #123

Closed nhe150 closed 8 years ago

nhe150 commented 8 years ago

I have run the imagenet in yarn cluster mode. Noticed nodemanager memory keep on increasing. Seems to be some memory leak in c++/jni code since coarsedGrainedbackend memory is very stable.

See the two process: (1127 keep on growing, while 1130 very stable)

**0 S yarn 1127 1125 0 80 0 - 2910 wait 13:15 ? 00:00:00 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/../../../CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/gpu/cuda/lib64:/data02/nhe/SparkNet/lib:/data02/nhe/cuda-7.0::/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp '-Dspark.authenticate=false' '-Dspark.driver.port=56487' '-Dspark.shuffle.service.port=7337' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@105.144.47.43:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app**.jar 1> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stdout 2> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stderr


0 S yarn 1130 1127 99 80 0 - 56878287 futex_ 13:15 ? 01:25:40 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp -Dspark.authenticate=false -Dspark.driver.port=56487 -Dspark.shuffle.service.port=7337 -Dspark.ui.port=0 -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@105.144.47.43:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar

robertnishihara commented 8 years ago

Thanks for posting the issue! Is this using SparkNet with Caffe (or TensorFlow)? We're trying to reproduce it at the moment.

nhe150 commented 8 years ago

This is using SparkNet with Caffe. Here is more information. I tried to correlate the time where the memory leak happenes. Here is the logs:

The driver logs:: Tue Apr 26 10:25:09 PDT 2016 232.468, i = 1: collecting weights Tue Apr 26 10:25:26 PDT 2016 249.743: collect took 17.274 s

Tue Apr 26 10:25:26 PDT 2016 250.293, i = 1: weight = 0.008250123 Tue Apr 26 10:25:26 PDT 2016 250.293, i = 2: broadcasting weights Tue Apr 26 10:25:27 PDT 2016 250.763: broadcast took 0.47 s

Tue Apr 26 10:25:27 PDT 2016 250.763, i = 2: setting weights on workers Tue Apr 26 10:25:29 PDT 2016 252.46: setweight took 1.697 s

Tue Apr 26 10:25:29 PDT 2016 252.46, i = 2: training Tue Apr 26 10:25:44 PDT 2016 267.882, i = 2: collecting weights Tue Apr 26 10:26:00 PDT 2016 283.696: collect took 15.814 s

Tue Apr 26 10:26:00 PDT 2016 284.225, i = 2: weight = 0.008250123 Tue Apr 26 10:26:00 PDT 2016 284.225, i = 3: broadcasting weights Tue Apr 26 10:26:01 PDT 2016 284.668: broadcast took 0.443 s

Tue Apr 26 10:26:01 PDT 2016 284.668, i = 3: setting weights on workers Tue Apr 26 10:26:03 PDT 2016 286.562: setweight took 1.894 s

Tue Apr 26 10:26:03 PDT 2016 286.562, i = 3: training Tue Apr 26 10:26:18 PDT 2016 301.765, i = 3: collecting weights Tue Apr 26 10:26:35 PDT 2016 319.236: collect took 17.406 s

Tue Apr 26 10:26:36 PDT 2016 319.761, i = 3: weight = 0.008250123 Tue Apr 26 10:26:36 PDT 2016 319.761, i = 4: broadcasting weights Tue Apr 26 10:26:36 PDT 2016 320.2: broadcast took 0.439 s

Tue Apr 26 10:26:36 PDT 2016 320.2, i = 4: setting weights on workers Tue Apr 26 10:26:38 PDT 2016 321.874: setweight took 1.674 s

Tue Apr 26 10:26:38 PDT 2016 321.874, i = 4: training Tue Apr 26 10:26:54 PDT 2016 337.492, i = 4: collecting weights Tue Apr 26 10:27:10 PDT 2016 353.704: collect took 16.211 s

Tue Apr 26 10:27:10 PDT 2016 354.226, i = 4: weight = 0.008250123 Tue Apr 26 10:27:10 PDT 2016 354.226, i = 5: broadcasting weights Tue Apr 26 10:27:11 PDT 2016 354.768: broadcast took 0.542 s

Tue Apr 26 10:27:11 PDT 2016 354.768, i = 5: setting weights on workers Tue Apr 26 10:27:13 PDT 2016 356.384: setweight took 1.616 s

Tue Apr 26 10:27:13 PDT 2016 356.384, i = 5: training Tue Apr 26 10:27:28 PDT 2016 371.715, i = 5: collecting weights

The nodemana## ger logs where memory keep on growing:: 2016-04-26 10:25:28,265 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:31,312 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:34,360 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:37,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:40,451 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:43,498 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:46,525 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:49,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:52,595 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:55,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:25:58,685 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:01,732 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:04,779 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:07,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:10,870 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:13,917 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:16,963 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:20,012 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:23,057 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:26,103 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:29,148 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:32,194 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:35,239 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:38,285 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:41,329 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:44,377 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:47,424 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:50,471 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:53,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:56,578 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:26:59,624 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:02,669 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:05,715 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:08,760 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:11,813 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:14,859 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:17,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:20,951 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:23,998 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:27,045 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:30,093 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:33,139 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:36,200 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:39,246 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:42,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:45,337 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:48,383 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:51,416 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:54,450 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:27:57,482 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:00,530 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:03,578 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:06,624 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:09,670 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:12,716 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:15,760 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:18,822 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:21,868 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:24,916 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:27,965 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:31,012 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:34,059 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:37,106 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used 2016-04-26 10:28:40,148 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 12.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used

nhe150 commented 8 years ago

Reading mean image from file for ImageNet will speed up the process: Here is the code: val fileName = sparkNetHome + "/imagenet.mean" val in: ObjectInputStream = new ObjectInputStream(new FileInputStream(fileName)) val meanImage: Array[Float] = in.readObject().asInstanceOf[Array[Float]] logger.log("reading mean ")

Here is the mean image for Imagenet imagenet.mean.zip

pcmoritz commented 8 years ago

Thanks a lot! On the CifarApp in local mode, the problem does not seem to occur; trying ImageNet now. We need to figure out if we can reproduce the memory leak without YARN, and then use a memory profiler to track it down. If it is easy for you to run a memory profiler with your current setup, that might help us diagnosing and reproducing the bug.

nhe150 commented 8 years ago

Will run in spark standalone cluster.
I had run jvisualvm on yarn, the coarse grained executor memory is stable, however the yarn container that spawns the coarse grained executor(which is a shell script) memory keep on growing up like posted(which I havenot profiled).

nhe150 commented 8 years ago

Compared two heapdump which has 1 G memory difference.

float[] contritue to 300M, which pinpoint to data JavaNDArray byte[] contribute to 600M, which pinpoint to buf ByteArrayOutputStream

Each heap dump is 10G or more, hard to load up. Hopefully this will help. Here is all code related to data in scala/Java Part: ( I am going through them now)

java/libs/JavaNDUtils.java: public static final int[] copyOf(int[] data) { java/libs/JavaNDUtils.java: return Arrays.copyOf(data, data.length); java/libs/JavaNDUtils.java: // Remove element from position index in data, return deep copy java/libs/JavaNDUtils.java: public static int[] removeIndex(int[] data, int index) { java/libs/JavaNDUtils.java: assert(index < data.length); java/libs/JavaNDUtils.java: int len = data.length; java/libs/JavaNDUtils.java: System.arraycopy(data, 0, result, 0, index); java/libs/JavaNDUtils.java: System.arraycopy(data, index + 1, result, index, len - index - 1); java/libs/JavaNDArray.java: protected final float[] data; java/libs/JavaNDArray.java: public JavaNDArray(float[] data, int dim, int[] shape, int offset, int[] strides) { java/libs/JavaNDArray.java: this.data = data; java/libs/JavaNDArray.java: public JavaNDArray(float[] data, int... shape) { java/libs/JavaNDArray.java: this(data, shape.length, shape, 0, JavaNDUtils.calcDefaultStrides(shape)); java/libs/JavaNDArray.java: return new JavaNDArray(data, dim - 1, JavaNDUtils.removeIndex(shape, axis), offset + index * strides[axis], JavaNDUtils.removeIndex(strides, axis)); java/libs/JavaNDArray.java: return new JavaNDArray(data, dim, JavaNDUtils.copyOf(newShape), offset + JavaNDUtils.dot(lowerOffsets, strides), strides); // todo: why copy shape? java/libs/JavaNDArray.java: data[ix] = value; java/libs/JavaNDArray.java: return data[ix]; java/libs/JavaNDArray.java: System.arraycopy(data, offset, result, flatIndex, shape[dim - 1]); java/libs/JavaNDArray.java: result[flatIndex] = data[offset + i * strides[dim - 1]]; java/libs/JavaNDArray.java: result[0] = data[offset]; java/libs/JavaNDArray.java: return new JavaNDArray(data, flatShape.length, flatShape, 0, JavaNDUtils.calcDefaultStrides(flatShape)); java/libs/JavaNDArray.java: return data; scala/libs/JavaCPPUtils.scala: val data = new ArrayFloat scala/libs/JavaCPPUtils.scala: val pointer = floatBlob.cpu_data scala/libs/JavaCPPUtils.scala: data(i) = pointer.get(i) scala/libs/JavaCPPUtils.scala: NDArray(data, shape) scala/libs/JavaCPPUtils.scala: val buffer = blob.mutable_cpu_data() scala/libs/JavaCPPUtils.scala: val buffer = blob.cpudata() scala/libs/Preprocessor.scala:// The Preprocessor is provides a function for reading data from a dataframe row scala/libs/Preprocessor.scala:// The convert method in DefaultPreprocessor is used to convert data extracted scala/libs/Preprocessor.scala:// from a dataframe into an NDArray, which can then be passed into a net. The scala/libs/Preprocessor.scala: schema(name).dataType match { scala/libs/Preprocessor.scala: schema(name).dataType match { scala/libs/Preprocessor.scala: schema(name).dataType match { scala/libs/Preprocessor.scala: } else if (name == "data") { scala/libs/Preprocessor.scala: throw new Exception("The name is not label or data, name = " + name + "\n") scala/libs/NDArray.scala: def apply(data: Array[Float], shape: Array[Int]) = { scala/libs/NDArray.scala: if (data.length != shape.product) { scala/libs/NDArray.scala: throw new IllegalArgumentException("The data and shape arguments are not compatible, data.length = " + data.length.toString + " and shape = " + shape.deep + ".\n") scala/libs/NDArray.scala: new NDArray(new JavaNDArray(data, shape:*)) scala/libs/CaffeNet.scala: // Preallocate a buffer for data input into the net scala/libs/CaffeNet.scala: // data scala/libs/CaffeNet.scala: def forward(rowIt: Iterator[Row], dataBlobNames: List[String] = List[String]()): Map[String, NDArray] = { scala/libs/CaffeNet.scala: for (name <- dataBlobNames) { scala/libs/CaffeNet.scala: val data = new ArrayFloat scala/libs/CaffeNet.scala: blob.cpu_data.get(data, 0, data.length) scala/libs/CaffeNet.scala: weightList += NDArray(data, shape) scala/libs/CaffeNet.scala: blob.mutable_cpu_data.put(flatWeights, 0, flatWeights.length)

for buf only things is in :+1: ScaleAndConverter
val im = ImageIO.read(new ByteArrayInputStream(compressedImage)) val resizedImage = Thumbnails.of(im).forceSize(width, height).asBufferedImage() Some(BufferedImageToByteArray(resizedImage))

Let us kill this bug so that we can benchmark ImageNet. :)

nhe150 commented 8 years ago

Find out the issue. It is due to broadcast variable keep on adding up.

Here is the solution:(unpersist and destory broadcast variable, I am using spark 1.6.0) logger.log("setting weights on workers", i) workers.foreach(_ => workerStore.get[CaffeSolver]").trainNet.setWeights(broadcastWeights.value)) broadcastWeights.unpersist() broadcastWeights.destroy()

pcmoritz commented 8 years ago

That's really great! Would you be interested in submitting a PR that fixes it (otherwise we'll do it and reference this issue)?

nhe150 commented 8 years ago

Would you please do it this time? I will submit PR for other enhancement later. thanks.

pcmoritz commented 8 years ago

Ok, I created PR #125. Thanks again for finding and fixing the problem!

We are doing some more testing before merging it, please let us also know about your experience on YARN (we are not running on YARN)