Slow Tensor creation on linux/gpu

evanthomas commented 6 years ago

I'm trying load the matlab version of VGG19. I'm using a nice java library to read the data out of the mat file and into high dimensional array is scala land. This works well and takes a minute or two to load the data into memory. However, when I convert the arrays into Tensor objects the code is burning ~1300% CPU for tens of minutes before I kill it. Thread dumps (below) show a single active java thread in JNI land.

How can I improve the performance of the Tensor creation?

My Tensor creation looks like

  private def array4[J <: Number, T: Manifest](m: MLNumericArray[J]): Tensor = {
    val dims = m.getDimensions
    val (d1, d2, d3, d4) = (dims(0), dims(1), dims(2), dims(3))
    val arr = Array.ofDim[T](d1, d2, d3, d4)

    for {
      i1 <- 0 until d1
      i2 <- 0 until d2
      i3 <- 0 until d3
      i4 <- 0 until d4
    } {
      arr(i1)(i2)(i3)(i4) = m.getReal(i1, i2, i3, i4).asInstanceOf[T]
    }

    m.getType match {
      case MLArray.mxSINGLE_CLASS => Tensor(arr.asInstanceOf[Array[Array[Array[Array[Float]]]]])
      case MLArray.mxDOUBLE_CLASS => Tensor(arr.asInstanceOf[Array[Array[Array[Array[Double]]]]])
      case _ => throw NotImplemented("Convert MLNumericArray to Tensor")
    }
  }

Here are a couple of thread dumps:

Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode):

"TensorFlow Scala API Disposer" #22 daemon prio=10 os_prio=0 tid=0x00007f710fc9b000 nid=0x2ece in Object.wait() [0x00007f7218984000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edab7a68> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edab7a68> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at org.platanios.tensorflow.api.utilities.Disposer.run(Disposer.scala:62)
    at java.lang.Thread.run(Thread.java:748)

"Java2D Disposer" #20 daemon prio=10 os_prio=0 tid=0x00007f725e073800 nid=0x2ea0 in Object.wait() [0x00007f7220190000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edc52440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edc52440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at sun.java2d.Disposer.run(Disposer.java:148)
    at java.lang.Thread.run(Thread.java:748)

"Service Thread" #18 daemon prio=9 os_prio=0 tid=0x00007f725c693000 nid=0x2e99 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread11" #17 daemon prio=9 os_prio=0 tid=0x00007f725c68d800 nid=0x2e98 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread10" #16 daemon prio=9 os_prio=0 tid=0x00007f725c68b800 nid=0x2e97 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread9" #15 daemon prio=9 os_prio=0 tid=0x00007f725c689800 nid=0x2e96 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread8" #14 daemon prio=9 os_prio=0 tid=0x00007f725c687800 nid=0x2e95 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread7" #13 daemon prio=9 os_prio=0 tid=0x00007f725c685000 nid=0x2e94 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread6" #12 daemon prio=9 os_prio=0 tid=0x00007f725c683800 nid=0x2e93 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread5" #11 daemon prio=9 os_prio=0 tid=0x00007f725c681000 nid=0x2e92 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread4" #10 daemon prio=9 os_prio=0 tid=0x00007f725c677000 nid=0x2e91 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread3" #9 daemon prio=9 os_prio=0 tid=0x00007f725c675000 nid=0x2e90 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread2" #8 daemon prio=9 os_prio=0 tid=0x00007f725c672800 nid=0x2e8f waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #7 daemon prio=9 os_prio=0 tid=0x00007f725c670000 nid=0x2e8e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 tid=0x00007f725c66e800 nid=0x2e8d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Monitor Ctrl-Break" #5 daemon prio=5 os_prio=0 tid=0x00007f725c66c000 nid=0x2e8c runnable [0x00007f7229e81000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    - locked <0x00000005edc61b78> (a java.io.InputStreamReader)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    - locked <0x00000005edc61b78> (a java.io.InputStreamReader)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:64)

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f725c2b6000 nid=0x2e8b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f725c283800 nid=0x2e7b in Object.wait() [0x00007f722aceb000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edafbeb8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edafbeb8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f725c27f000 nid=0x2e7a in Object.wait() [0x00007f722adec000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edc61d28> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000005edc61d28> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007f725c00f800 nid=0x2e67 runnable [0x00007f7264105000]
   java.lang.Thread.State: RUNNABLE
    at org.platanios.tensorflow.jni.Tensor$.allocate(Native Method)
    at org.platanios.tensorflow.api.tensors.Tensor$.fill(Tensor.scala:522)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$9.toTensor(Tensor.scala:817)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofFloat.foreach(ArrayOps.scala:263)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofFloat.map(ArrayOps.scala:263)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.Tensor$.$anonfun$apply$1(Tensor.scala:388)
    at org.platanios.tensorflow.api.tensors.Tensor$$$Lambda$39/204684384.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.platanios.tensorflow.api.tensors.Tensor$.apply(Tensor.scala:388)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.array4(NDArray.scala:93)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.array(NDArray.scala:17)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.apply(NDArray.scala:9)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$ConvLayerFactory.makeLayer(Layers.scala:27)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$ConvLayerFactory.makeLayer(Layers.scala:23)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$.apply(Layers.scala:14)
    at co.mumbler.models.TFVGG19$.$anonfun$readLayers$1(TFVGG19.scala:53)
    at co.mumbler.models.TFVGG19$$$Lambda$60/1820295244.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.Iterator.foreach(Iterator.scala:929)
    at scala.collection.Iterator.foreach$(Iterator.scala:929)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
    at scala.collection.IterableLike.foreach(IterableLike.scala:71)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at co.mumbler.models.TFVGG19$.readLayers(TFVGG19.scala:52)
    at co.mumbler.models.TFVGG19$.loadNet(TFVGG19.scala:40)
    at co.mumbler.imageai.server.launch.ScalaStylize$.main(ScalaStylize.scala:18)
    at co.mumbler.imageai.server.launch.ScalaStylize.main(ScalaStylize.scala)

"VM Thread" os_prio=0 tid=0x00007f725c277000 nid=0x2e79 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f725c025800 nid=0x2e68 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f725c027000 nid=0x2e69 runnable 

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f725c029000 nid=0x2e6a runnable 

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f725c02a800 nid=0x2e6b runnable 

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f725c02c800 nid=0x2e6d runnable 

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f725c02e000 nid=0x2e6e runnable 

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007f725c030000 nid=0x2e6f runnable 

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007f725c031800 nid=0x2e70 runnable 

"GC task thread#8 (ParallelGC)" os_prio=0 tid=0x00007f725c033800 nid=0x2e71 runnable 

"GC task thread#9 (ParallelGC)" os_prio=0 tid=0x00007f725c035000 nid=0x2e72 runnable 

"GC task thread#10 (ParallelGC)" os_prio=0 tid=0x00007f725c037000 nid=0x2e73 runnable 

"GC task thread#11 (ParallelGC)" os_prio=0 tid=0x00007f725c038800 nid=0x2e74 runnable 

"GC task thread#12 (ParallelGC)" os_prio=0 tid=0x00007f725c03a800 nid=0x2e75 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f725c697800 nid=0x2e9a waiting on condition 

JNI global references: 608

Heap
 PSYoungGen      total 1830912K, used 386557K [0x0000000718600000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 915456K, 42% used [0x0000000718600000,0x000000072ff7f730,0x0000000750400000)
  from space 915456K, 0% used [0x0000000750400000,0x0000000750400000,0x0000000788200000)
  to   space 915456K, 0% used [0x0000000788200000,0x0000000788200000,0x00000007c0000000)
 ParOldGen       total 5492736K, used 5492705K [0x00000005c9200000, 0x0000000718600000, 0x0000000718600000)
  object space 5492736K, 99% used [0x00000005c9200000,0x00000007185f8470,0x0000000718600000)
 Metaspace       used 20079K, capacity 20520K, committed 20736K, reserved 1067008K
  class space    used 2633K, capacity 2747K, committed 2816K, reserved 1048576K

another one:

Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode):

"TensorFlow Scala API Disposer" #22 daemon prio=10 os_prio=0 tid=0x00007f710fc9b000 nid=0x2ece in Object.wait() [0x00007f7218984000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edab7a68> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edab7a68> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at org.platanios.tensorflow.api.utilities.Disposer.run(Disposer.scala:62)
    at java.lang.Thread.run(Thread.java:748)

"Java2D Disposer" #20 daemon prio=10 os_prio=0 tid=0x00007f725e073800 nid=0x2ea0 in Object.wait() [0x00007f7220190000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edc52440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edc52440> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at sun.java2d.Disposer.run(Disposer.java:148)
    at java.lang.Thread.run(Thread.java:748)

"Service Thread" #18 daemon prio=9 os_prio=0 tid=0x00007f725c693000 nid=0x2e99 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread11" #17 daemon prio=9 os_prio=0 tid=0x00007f725c68d800 nid=0x2e98 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread10" #16 daemon prio=9 os_prio=0 tid=0x00007f725c68b800 nid=0x2e97 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread9" #15 daemon prio=9 os_prio=0 tid=0x00007f725c689800 nid=0x2e96 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread8" #14 daemon prio=9 os_prio=0 tid=0x00007f725c687800 nid=0x2e95 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread7" #13 daemon prio=9 os_prio=0 tid=0x00007f725c685000 nid=0x2e94 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread6" #12 daemon prio=9 os_prio=0 tid=0x00007f725c683800 nid=0x2e93 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread5" #11 daemon prio=9 os_prio=0 tid=0x00007f725c681000 nid=0x2e92 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread4" #10 daemon prio=9 os_prio=0 tid=0x00007f725c677000 nid=0x2e91 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread3" #9 daemon prio=9 os_prio=0 tid=0x00007f725c675000 nid=0x2e90 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread2" #8 daemon prio=9 os_prio=0 tid=0x00007f725c672800 nid=0x2e8f waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #7 daemon prio=9 os_prio=0 tid=0x00007f725c670000 nid=0x2e8e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 tid=0x00007f725c66e800 nid=0x2e8d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Monitor Ctrl-Break" #5 daemon prio=5 os_prio=0 tid=0x00007f725c66c000 nid=0x2e8c runnable [0x00007f7229e81000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    - locked <0x00000005edc61b78> (a java.io.InputStreamReader)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    - locked <0x00000005edc61b78> (a java.io.InputStreamReader)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:64)

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f725c2b6000 nid=0x2e8b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f725c283800 nid=0x2e7b in Object.wait() [0x00007f722aceb000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edafbeb8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000005edafbeb8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f725c27f000 nid=0x2e7a in Object.wait() [0x00007f722adec000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000005edc61d28> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000005edc61d28> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007f725c00f800 nid=0x2e67 runnable [0x00007f7264105000]
   java.lang.Thread.State: RUNNABLE
    at org.platanios.tensorflow.jni.Tensor$.buffer(Native Method)
    at org.platanios.tensorflow.api.tensors.Tensor$.fill(Tensor.scala:523)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$9.toTensor(Tensor.scala:817)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofFloat.foreach(ArrayOps.scala:263)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofFloat.map(ArrayOps.scala:263)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.$anonfun$toTensor$1(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10$$Lambda$40/1229202732.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
    at org.platanios.tensorflow.api.tensors.TensorConvertible$$anon$10.toTensor(Tensor.scala:824)
    at org.platanios.tensorflow.api.tensors.Tensor$.$anonfun$apply$1(Tensor.scala:388)
    at org.platanios.tensorflow.api.tensors.Tensor$$$Lambda$39/204684384.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.platanios.tensorflow.api.tensors.Tensor$.apply(Tensor.scala:388)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.array4(NDArray.scala:93)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.array(NDArray.scala:17)
    at co.mumbler.imageai.server.launch.io.matlab.NDArray$.apply(NDArray.scala:9)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$ConvLayerFactory.makeLayer(Layers.scala:27)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$ConvLayerFactory.makeLayer(Layers.scala:23)
    at co.mumbler.imageai.server.launch.io.matlab.Layers$.apply(Layers.scala:14)
    at co.mumbler.models.TFVGG19$.$anonfun$readLayers$1(TFVGG19.scala:53)
    at co.mumbler.models.TFVGG19$$$Lambda$60/1820295244.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$18/1786364562.apply(Unknown Source)
    at scala.collection.Iterator.foreach(Iterator.scala:929)
    at scala.collection.Iterator.foreach$(Iterator.scala:929)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
    at scala.collection.IterableLike.foreach(IterableLike.scala:71)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at co.mumbler.models.TFVGG19$.readLayers(TFVGG19.scala:52)
    at co.mumbler.models.TFVGG19$.loadNet(TFVGG19.scala:40)
    at co.mumbler.imageai.server.launch.ScalaStylize$.main(ScalaStylize.scala:18)
    at co.mumbler.imageai.server.launch.ScalaStylize.main(ScalaStylize.scala)

"VM Thread" os_prio=0 tid=0x00007f725c277000 nid=0x2e79 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f725c025800 nid=0x2e68 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f725c027000 nid=0x2e69 runnable 

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f725c029000 nid=0x2e6a runnable 

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f725c02a800 nid=0x2e6b runnable 

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f725c02c800 nid=0x2e6d runnable 

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f725c02e000 nid=0x2e6e runnable 

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007f725c030000 nid=0x2e6f runnable 

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007f725c031800 nid=0x2e70 runnable 

"GC task thread#8 (ParallelGC)" os_prio=0 tid=0x00007f725c033800 nid=0x2e71 runnable 

"GC task thread#9 (ParallelGC)" os_prio=0 tid=0x00007f725c035000 nid=0x2e72 runnable 

"GC task thread#10 (ParallelGC)" os_prio=0 tid=0x00007f725c037000 nid=0x2e73 runnable 

"GC task thread#11 (ParallelGC)" os_prio=0 tid=0x00007f725c038800 nid=0x2e74 runnable 

"GC task thread#12 (ParallelGC)" os_prio=0 tid=0x00007f725c03a800 nid=0x2e75 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f725c697800 nid=0x2e9a waiting on condition 

JNI global references: 608

Heap
 PSYoungGen      total 1830912K, used 888314K [0x0000000718600000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 915456K, 97% used [0x0000000718600000,0x000000074e97eaf8,0x0000000750400000)
  from space 915456K, 0% used [0x0000000750400000,0x0000000750400000,0x0000000788200000)
  to   space 915456K, 0% used [0x0000000788200000,0x0000000788200000,0x00000007c0000000)
 ParOldGen       total 5492736K, used 5492702K [0x00000005c9200000, 0x0000000718600000, 0x0000000718600000)
  object space 5492736K, 99% used [0x00000005c9200000,0x00000007185f7bf8,0x0000000718600000)
 Metaspace       used 20079K, capacity 20520K, committed 20736K, reserved 1067008K
  class space    used 2633K, capacity 2747K, committed 2816K, reserved 1048576K

eaplatanios commented 6 years ago

@evanthomas I don't know what this MLNumericArray class is but is there a way to get the contents of the array as a ByteBuffer directly?

evanthomas commented 6 years ago

@eaplatanios The problem is not getting it out of the MLNumericArray but getting it into the Tensor. As you can see from the code fragment, I pull it out a single value at a time, I can just as easily stuff it into a ByteBuffer rather than N-dim array and pass the buffer to Tensor. I'll let you know how it goes.

eaplatanios commented 6 years ago

@evanthomas The fastest way to get it into the Tensor is through the byte buffer with Tensor.fromBuffer (you can look at the MNIST data loader for an example). However, note that getting them out one by one is also not the best choice if you can directly obtain the buffer.

On Nov 20, 2017, 7:58 PM -0500, evanthomas notifications@github.com, wrote:

@eaplatanios The problem is not getting it out of the MLNumericArray but getting it into the Tensor. As you can see from the code fragment, I pull it out a single value at a time, I can just as easily stuff it into a ByteBuffer rather than N-dim array and pass the buffer to Tensor. I'll let you know how it goes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

evanthomas commented 6 years ago

@eaplatanios Tensor.fromBuffer is much, much faster than Arrays. (Arrays are essentially unusable). Here is a little test:

package co.mumbler.imageai.server.launch

import java.nio.{ByteBuffer, FloatBuffer}

import org.platanios.tensorflow.api._

import scala.util.Random

object TensorTest {

  def main(args: Array[String]) = {
    val shape = Shape(50, 50, 50, 50)
    val array = Array.ofDim[Float](shape(0), shape(1), shape(2), shape(3))
    val floats = FloatBuffer.allocate(shape.numElements.asInstanceOf[Int])
    val r = new Random()

    for {
      i1 <- 0 until shape(0)
      i2 <- 0 until shape(1)
      i3 <- 0 until shape(2)
      i4 <- 0 until shape(3)
    } {
      val x = r.nextFloat()
      array(i1)(i2)(i3)(i4) = x
      floats.put(x)
    }

    val buffer = ByteBuffer.allocate(floats.capacity()*4)
    buffer.asFloatBuffer().put(floats)

    time(buffer, shape)
    time(array)

  }

  private def time(array: Array[Array[Array[Array[Float]]]]) {
    val start = System.currentTimeMillis()

    val t = Tensor(array)

    println("array load: " + (System.currentTimeMillis() - start))

  }

  private def time(buffer: ByteBuffer, shape: Shape) {
    val l = shape.numElements
    val start = System.currentTimeMillis()

    val t = Tensor.fromBuffer(FLOAT32, shape, l, buffer)

    println("buffer load: " + (System.currentTimeMillis() - start))

  }
}

Here are the results:

buffer load: 51
array load: 27095

eaplatanios commented 6 years ago

@evanthomas I'm glad to see you resolved your issue. Note that this makes sense for a lot of reasons. One of which is that your array creation time might be very slow (you can profile that actually). You have a for-comprehension over the four indices. Even though you may think of that as a for-loop it's not the same. This is not equivalent to four nested tight loops. In fact, the indices may also be boxed/unboxed every time which might also be slow (I'm not sure about that but by profiling that code you can figure all this out). In either case, the fromBuffer call is much more efficient in that it also involves a single JNI call with shared memory rather than multiple JNI calls where the elements of the array are copied. :)

evanthomas commented 6 years ago

The MLNumericArray does have direct access to the ByteBuffer and now the load into Tensors is blinding fast, significantly faster than the numpy equivalent.

eaplatanios commented 6 years ago

That's great to hear! :) I'm particularly happy about the numpy comparison :) Things should also be faster than the Python equivalent when you feed tensors into TensorFlow sessions, as I've made the memory shared and no copying is performed (aside for string tensors which require a copy). In Python the tensors are copied around more frequently.

eaplatanios / tensorflow_scala

Slow Tensor creation on linux/gpu #51