Closed evanthomas closed 6 years ago
@evanthomas I don't know what this MLNumericArray
class is but is there a way to get the contents of the array as a ByteBuffer
directly?
@eaplatanios The problem is not getting it out of the MLNumericArray
but getting it into the Tensor
. As you can see from the code fragment, I pull it out a single value at a time, I can just as easily stuff it into a ByteBuffer
rather than N-dim array and pass the buffer to Tensor
. I'll let you know how it goes.
@evanthomas The fastest way to get it into the Tensor is through the byte buffer with Tensor.fromBuffer (you can look at the MNIST data loader for an example). However, note that getting them out one by one is also not the best choice if you can directly obtain the buffer.
On Nov 20, 2017, 7:58 PM -0500, evanthomas notifications@github.com, wrote:
@eaplatanios The problem is not getting it out of the MLNumericArray but getting it into the Tensor. As you can see from the code fragment, I pull it out a single value at a time, I can just as easily stuff it into a ByteBuffer rather than N-dim array and pass the buffer to Tensor. I'll let you know how it goes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@eaplatanios Tensor.fromBuffer
is much, much faster than Arrays. (Arrays are essentially unusable). Here is a little test:
package co.mumbler.imageai.server.launch
import java.nio.{ByteBuffer, FloatBuffer}
import org.platanios.tensorflow.api._
import scala.util.Random
object TensorTest {
def main(args: Array[String]) = {
val shape = Shape(50, 50, 50, 50)
val array = Array.ofDim[Float](shape(0), shape(1), shape(2), shape(3))
val floats = FloatBuffer.allocate(shape.numElements.asInstanceOf[Int])
val r = new Random()
for {
i1 <- 0 until shape(0)
i2 <- 0 until shape(1)
i3 <- 0 until shape(2)
i4 <- 0 until shape(3)
} {
val x = r.nextFloat()
array(i1)(i2)(i3)(i4) = x
floats.put(x)
}
val buffer = ByteBuffer.allocate(floats.capacity()*4)
buffer.asFloatBuffer().put(floats)
time(buffer, shape)
time(array)
}
private def time(array: Array[Array[Array[Array[Float]]]]) {
val start = System.currentTimeMillis()
val t = Tensor(array)
println("array load: " + (System.currentTimeMillis() - start))
}
private def time(buffer: ByteBuffer, shape: Shape) {
val l = shape.numElements
val start = System.currentTimeMillis()
val t = Tensor.fromBuffer(FLOAT32, shape, l, buffer)
println("buffer load: " + (System.currentTimeMillis() - start))
}
}
Here are the results:
buffer load: 51
array load: 27095
@evanthomas I'm glad to see you resolved your issue. Note that this makes sense for a lot of reasons. One of which is that your array creation time might be very slow (you can profile that actually). You have a for-comprehension over the four indices. Even though you may think of that as a for-loop it's not the same. This is not equivalent to four nested tight loops. In fact, the indices may also be boxed/unboxed every time which might also be slow (I'm not sure about that but by profiling that code you can figure all this out). In either case, the fromBuffer
call is much more efficient in that it also involves a single JNI call with shared memory rather than multiple JNI calls where the elements of the array are copied. :)
The MLNumericArray
does have direct access to the ByteBuffer and now the load into Tensors is blinding fast, significantly faster than the numpy equivalent.
That's great to hear! :) I'm particularly happy about the numpy comparison :) Things should also be faster than the Python equivalent when you feed tensors into TensorFlow sessions, as I've made the memory shared and no copying is performed (aside for string tensors which require a copy). In Python the tensors are copied around more frequently.
I'm trying load the matlab version of VGG19. I'm using a nice java library to read the data out of the mat file and into high dimensional array is scala land. This works well and takes a minute or two to load the data into memory. However, when I convert the arrays into Tensor objects the code is burning ~1300% CPU for tens of minutes before I kill it. Thread dumps (below) show a single active java thread in JNI land.
How can I improve the performance of the Tensor creation?
My Tensor creation looks like
Here are a couple of thread dumps:
another one: