Closed DirkToewe closed 5 years ago
@DirkToewe That looks like a memory bug which is more likely to not be due to TF Scala, but there is a chance that multi-threading is causing it. Did you manage to run and collect more debugging information? Because I can't reproduce this easily and there is not enough information here to figure out what's going on.
All I managed to find out is that there's a memory leak and it seems happen during the creation of new Tensors. The following example blows on my puny 32gb machine after run 67:
val in_x = tf.placeholder(FLOAT32, Shape(-1,-1) )
val op_2x = 2*in_x
val session = Session()
val rng = new Random(1337)
try {
for( run <- 0 to 1024*1024 )
{
println(f"Run$run%4d")
val nRows = rng.nextInt(256*1024)+1
val nCols = 15
val arr = Array.fill(nRows*nCols)(rng.nextFloat)
val ten = (arr: Tensor) reshape Shape(-1,15)
val Seq(x2: Tensor) = session.run( in_x -> ten, Seq(op_2x) )
assert( x2.entriesIterator.zipWithIndex.forall{ case (f,i) => f == 2*arr(i) } )
}
} finally { session.close() }
Tensors are not added to the computation graph, right? But even then each tensor should be just around 16MB...
Edit: Digging through the code, I found out that the Tensor
creation via TensorConvertible
is yet a little sub-optimal. If, for example, an Array is converted to a Tensor, each entry of the Array is converted into a Tensor and only then is it stacked to a whole Tensor. While this is fine on the JVM with it's awesome garbage collection and escape analysis and what-not-else, the problem is that each Tensor allocates some resources on the native heap. My suspicion is that that causes the heap to get fragmented. To test the hypothesis, I added the following TensorConvertible
to the example above:
implicit val arr2ten: TensorConvertible[Array[Float]] = arr => {
val buf = ByteBuffer allocateDirect FLOAT32.byteSize*arr.length
buf order ByteOrder.nativeOrder()
arr foreach buf.putFloat
buf.position(0)
Tensor.fromBuffer(FLOAT32, Shape(arr.length), buf.capacity(), buf)
}
And that seems to do the trick. So possible fixes are:
TensorConvertible
, e.g. to return a shape and an Iterator instead of a Tensor
Edit2: @eaplatanios At a second glance: There is no finalizer and no PhantomReferences to release the native resources, right? It's my bad then, because I did not know that I needed to call close() on each Tensor manually.
@DirkToewe I do use a disposing thread in the background to dispose unused resources. It's using phantom references. I wonder if the way in which these implicits interact messes up with the garbage collector somehow and prevents it from collecting the intermediate tensors that are created. I'll try to look into this this weekend.
It could also be that Disposer
is crashing but I won't be able to check until monday.
@DirkToewe Sorry for the super late followup, but did you end up looking further into this?
Nothing much new to report, I'm afrad. The disposert thread is running all the time. With Scala 2.11 in Jupyter Notebook (jupyter-scala), I just discovered an issue that may be related:
Could it be that the JVM enqueues the reference before the native code is done with it?
Edit: Found the bug and issued a pull request: Tensor::entriesIterator did not release its native resources and held a reference to the Tensor that created it. The other issue in the Jupyter notebook above was my mistake: println( ( -floatTensor ).sqrt().summarize() )
is equivalent to println( ( -floatTensor ).sqrt.apply().summarize() )
... still this is a nasty pitfall and maybe a warning and/or preventing empy slicing vararg would be helpful ...
Thanks for catching that memory leak. Regarding the apply method issue, I think that's a common ambiguity in Scala and may be resolved in Dotty (I'm not 100% sure). I'll look into maybe throwing a more informative error.
The apply issue is however especially confusing if most operations like argmax, max, ... do require You to write empty brackets ... maybe it would be better to define def sqrt
as def sqrt()
just in case You ever have to add some arguments to it, e.g. it outputting a complex result which would be defined for negative values as well.
This may likely be an issue of the native libraries, but since it might affect others as well, I'm posting it here. Plus it could maybe be an issue with concurrency?
This bug can be reliably reproduced on 2 PCs:
Reading
corrupted size vs. prev_size
, I tried it with:"0.2.4" classifier "linux-gpu-x86_64"
I did not yet encounter this issue with
"0.2.4" classifier "linux-cpu-x86_64"
. But the problem is it's ~25x slower so it may still fail in a day or so. In the meantime I'm going to make a debug build and see if GDB may be able to tell me more.