corrupted size vs. prev_size [libtensorflow.so+0x724284]

DirkToewe commented 6 years ago

This may likely be an issue of the native libraries, but since it might affect others as well, I'm posting it here. Plus it could maybe be an issue with concurrency?

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007feb221dc284, pid=6389, tid=0x00007febb9286700
#
# JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11)
# Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow.so+0x724284]

[error] corrupted size vs. prev_size
[info] # C  [libtensorflow.so+0x724284]
[error] java.lang.RuntimeException: Nonzero exit code returned from runner: 134
[error]     at sbt.ForkRun.processExitCode$1(Run.scala:33)
[error]     at sbt.ForkRun.run(Run.scala:42)
[error]     at sbt.Defaults$.$anonfun$bgRunMainTask$6(Defaults.scala:1163)
[error]     at sbt.Defaults$.$anonfun$bgRunMainTask$6$adapted(Defaults.scala:1158)
[error]     at sbt.internal.BackgroundThreadPool.$anonfun$run$1(DefaultBackgroundJobService.scala:366)
[error]     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error]     at scala.util.Try$.apply(Try.scala:209)
[error]     at sbt.internal.BackgroundThreadPool$BackgroundRunnable.run(DefaultBackgroundJobService.scala:289)
[error]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error]     at java.lang.Thread.run(Thread.java:748)

This bug can be reliably reproduced on 2 PCs:

AMD FX-8150 / GTX1080Ti / Ubuntu 18.04
AMD Phenom II X4 940 / GT640 / Ubuntu 16.04

Reading corrupted size vs. prev_size, I tried it with:

"0.2.4" classifier "linux-gpu-x86_64"
0.2.4 with custom build (cuda, -march=native, jemalloc)
0.2.4 with custom build (cuda, -march=native, tcmalloc)

I did not yet encounter this issue with "0.2.4" classifier "linux-cpu-x86_64". But the problem is it's ~25x slower so it may still fail in a day or so. In the meantime I'm going to make a debug build and see if GDB may be able to tell me more.

eaplatanios commented 6 years ago

@DirkToewe That looks like a memory bug which is more likely to not be due to TF Scala, but there is a chance that multi-threading is causing it. Did you manage to run and collect more debugging information? Because I can't reproduce this easily and there is not enough information here to figure out what's going on.

DirkToewe commented 6 years ago

All I managed to find out is that there's a memory leak and it seems happen during the creation of new Tensors. The following example blows on my puny 32gb machine after run 67:

    val in_x = tf.placeholder(FLOAT32, Shape(-1,-1) )
    val op_2x = 2*in_x
    val session = Session()

    val rng = new Random(1337)
    try {
      for( run <- 0 to 1024*1024 )
      {
        println(f"Run$run%4d")
        val nRows = rng.nextInt(256*1024)+1
        val nCols = 15
        val arr = Array.fill(nRows*nCols)(rng.nextFloat)
        val ten = (arr: Tensor) reshape Shape(-1,15)
        val Seq(x2: Tensor) = session.run( in_x -> ten, Seq(op_2x) )
        assert( x2.entriesIterator.zipWithIndex.forall{ case (f,i) => f == 2*arr(i) } )
      }
    } finally { session.close() }

Tensors are not added to the computation graph, right? But even then each tensor should be just around 16MB...

Edit: Digging through the code, I found out that the Tensor creation via TensorConvertible is yet a little sub-optimal. If, for example, an Array is converted to a Tensor, each entry of the Array is converted into a Tensor and only then is it stacked to a whole Tensor. While this is fine on the JVM with it's awesome garbage collection and escape analysis and what-not-else, the problem is that each Tensor allocates some resources on the native heap. My suspicion is that that causes the heap to get fragmented. To test the hypothesis, I added the following TensorConvertible to the example above:

  implicit val arr2ten: TensorConvertible[Array[Float]] = arr => {
    val buf = ByteBuffer allocateDirect FLOAT32.byteSize*arr.length
    buf order ByteOrder.nativeOrder()
    arr foreach buf.putFloat
    buf.position(0)
    Tensor.fromBuffer(FLOAT32, Shape(arr.length), buf.capacity(), buf)
  }

And that seems to do the trick. So possible fixes are:

Add more TensorConvertibles
Change the signature for TensorConvertible, e.g. to return a shape and an Iterator instead of a Tensor
Only turn Tensors native when neccessary but then again stack would not work

Edit2: @eaplatanios At a second glance: There is no finalizer and no PhantomReferences to release the native resources, right? It's my bad then, because I did not know that I needed to call close() on each Tensor manually.

eaplatanios commented 6 years ago

@DirkToewe I do use a disposing thread in the background to dispose unused resources. It's using phantom references. I wonder if the way in which these implicits interact messes up with the garbage collector somehow and prevents it from collecting the intermediate tensors that are created. I'll try to look into this this weekend.

DirkToewe commented 6 years ago

It could also be that Disposer is crashing but I won't be able to check until monday.

eaplatanios commented 5 years ago

@DirkToewe Sorry for the super late followup, but did you end up looking further into this?

DirkToewe commented 5 years ago

Nothing much new to report, I'm afrad. The disposert thread is running all the time. With Scala 2.11 in Jupyter Notebook (jupyter-scala), I just discovered an issue that may be related:

disposer_issues

~~Could it be that the JVM enqueues the reference before the native code is done with it?~~

Edit: Found the bug and issued a pull request: Tensor::entriesIterator did not release its native resources and held a reference to the Tensor that created it. The other issue in the Jupyter notebook above was my mistake: println( ( -floatTensor ).sqrt().summarize() ) is equivalent to println( ( -floatTensor ).sqrt.apply().summarize() ) ... still this is a nasty pitfall and maybe a warning and/or preventing empy slicing vararg would be helpful ...

eaplatanios commented 5 years ago

Thanks for catching that memory leak. Regarding the apply method issue, I think that's a common ambiguity in Scala and may be resolved in Dotty (I'm not 100% sure). I'll look into maybe throwing a more informative error.

DirkToewe commented 5 years ago

The apply issue is however especially confusing if most operations like argmax, max, ... do require You to write empty brackets ... maybe it would be better to define def sqrt as def sqrt() just in case You ever have to add some arguments to it, e.g. it outputting a complex result which would be defined for negative values as well.

eaplatanios / tensorflow_scala

corrupted size vs. prev_size [libtensorflow.so+0x724284] #115