eaplatanios / tensorflow_scala

TensorFlow API for the Scala Programming Language
http://platanios.org/tensorflow_scala/
Apache License 2.0
936 stars 95 forks source link

SIGSEGV in UpdateEdge native code #59

Closed carlo-veezoo closed 6 years ago

carlo-veezoo commented 6 years ago

Disclaimer: I'm not able to reproduce this bug in a meaningful way, but I'll just add an issue anyways.

Since commit "Updated the Python API C++ file." I get the following error. When building my graph, I get a SIGSEGV in the native code (hs_err_pidxxxxx.log):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8724d65feb, pid=2371, tid=0x00007f8757bff700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow_jni.so+0x68feb]  tensorflow::UpdateEdge(TF_Graph*, TF_Output, TF_Input, TF_Status*)+0x4b
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f876c13a000):  JavaThread "run-main-0" [_thread_in_native, id=2858, stack(0x00007f8757aff000,0x00007f8757c00000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00000001748ce647

Registers:
RAX=0x0000000000000001, RBX=0x00007f86efb623b0, RCX=0x00007f86fee21078, RDX=0x000000002e919cb0
RSP=0x00007f8757bfa430, RBP=0x00007f86d0505240, RSI=0x00000000000000c7, RDI=0x00007f86efb623b0
R8 =0x0000000000000000, R9 =0x00007f86d0507460, R10=0x000000000000018a, R11=0x00007f8724d65fa0
R12=0x0000000000000000, R13=0x00007f86fee21110, R14=0x0000000000000001, R15=0x00007f86fee21078
RIP=0x00007f8724d65feb, EFLAGS=0x0000000000010246, CSGSFS=0xc715000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f8757bfa430)
0x00007f8757bfa430:   0000000000000008 00007f86fee21078
0x00007f8757bfa440:   00007f86d0505240 00007f86fee21078
0x00007f8757bfa450:   0000000000000001 c7155dc556bda100
0x00007f8757bfa460:   00007f86efb623b0 00007f86efb623b0
0x00007f8757bfa470:   00007f86d0505240 00007f86fee21078
0x00007f8757bfa480:   00007f876c13a1e0 00007f86fee21110
0x00007f8757bfa490:   0000000000000000 00007f8724d170c3
0x00007f8757bfa4a0:   00007f876c13a000 0000000000000001
0x00007f8757bfa4b0:   00007f8757bfa4f0 00007f8724ffb610
0x00007f8757bfa4c0:   00007f8757bfa580 0000000000000000
0x00007f8757bfa4d0:   00007f8724ffb5f8 00007f8757bfa5e0
0x00007f8757bfa4e0:   00007f876c13a000 00007f87e8017774
0x00007f8757bfa4f0:   00007f8700000001 00007f880e663e03
0x00007f8757bfa500:   00007f8724ffb610 0000000000000000
0x00007f8757bfa510:   00007f8724ffb5f8 00007f8757bfa5e0
0x00007f8757bfa520:   00007f8757bfa580 00007f87e80174f9
0x00007f8757bfa530:   fffffffe00000000 00007f87e80174c2
0x00007f8757bfa540:   00007f8757bfa540 00007f8724ffb5f8
0x00007f8757bfa550:   00007f8757bfa5e0 00007f8724ffc9e8
0x00007f8757bfa560:   0000000000000000 00007f8724ffb610
0x00007f8757bfa570:   0000000000000000 00007f8757bfa5a0
0x00007f8757bfa580:   00007f8757bfa628 00007f87e8007ffd
0x00007f8757bfa590:   0000000000000000 00007f87e8011278
0x00007f8757bfa5a0:   0000000000000001 00007f86fee21110
0x00007f8757bfa5b0:   000000062ee85428 0000000000000000
0x00007f8757bfa5c0:   00007f86fee21078 000000062ee69ba0
0x00007f8757bfa5d0:   00007f86efb623b0 000000062ee86478
0x00007f8757bfa5e0:   0000000623a61690 00007f8757bfa5e8
0x00007f8757bfa5f0:   00007f86f2a9b6ff 00007f8757bfa650
0x00007f8757bfa600:   00007f86f2a9d858 0000000000000000
0x00007f8757bfa610:   00007f86f2a9b750 00007f8757bfa5a0
0x00007f8757bfa620:   00007f8757bfa638 00007f8757bfa698 

Instructions: (pc=0x00007f8724d65feb)
0x00007f8724d65fcb:   44 24 28 31 c0 e8 eb 87 fa ff 4c 8b 8b f0 02 00
0x00007f8724d65fdb:   00 31 d2 4c 89 e8 48 8b b3 e8 02 00 00 49 f7 f1
0x00007f8724d65feb:   48 8b 04 d6 48 85 c0 74 28 48 8b 38 48 89 d1 4c
0x00007f8724d65ffb:   8b 47 08 4d 39 c5 74 2d 48 8b 3f 48 85 ff 74 11 

Register to memory mapping:

RAX=0x0000000000000001 is an unknown value
RBX=0x00007f86efb623b0 is an unknown value
RCX=0x00007f86fee21078 is an unknown value
RDX=0x000000002e919cb0 is an unknown value
RSP=0x00007f8757bfa430 is pointing into the stack for thread: 0x00007f876c13a000
RBP=0x00007f86d0505240 is an unknown value
RSI=0x00000000000000c7 is an unknown value
RDI=0x00007f86efb623b0 is an unknown value
R8 =0x0000000000000000 is an unknown value
R9 =0x00007f86d0507460 is an unknown value
R10=0x000000000000018a is an unknown value
R11=0x00007f8724d65fa0: _ZN10tensorflow10UpdateEdgeEP8TF_Graph9TF_Output8TF_InputP9TF_Status+0 in /tmp/tensorflow_scala_native_libraries641339624071488821/libtensorflow_jni.so at 0x00007f8724cfd000
R12=0x0000000000000000 is an unknown value
R13=0x00007f86fee21110 is an unknown value
R14=0x0000000000000001 is an unknown value
R15=0x00007f86fee21078 is an unknown value

Stack: [0x00007f8757aff000,0x00007f8757c00000],  sp=0x00007f8757bfa430,  free space=1005k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libtensorflow_jni.so+0x68feb]  tensorflow::UpdateEdge(TF_Graph*, TF_Output, TF_Input, TF_Status*)+0x4b

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.platanios.tensorflow.jni.TensorFlow$.updateInput(JJIJI)V+0
j  org.platanios.tensorflow.api.ops.control_flow.ControlFlow$.$anonfun$updateInput$1(Lorg/platanios/tensorflow/api/ops/Op;ILorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/core/Graph$Reference;)V+23
j  org.platanios.tensorflow.api.ops.control_flow.ControlFlow$.$anonfun$updateInput$1$adapted(Lorg/platanios/tensorflow/api/ops/Op;ILorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/core/Graph$Reference;)Ljava/lang/Object;+4
j  org.platanios.tensorflow.api.ops.control_flow.ControlFlow$$$Lambda$9291.apply(Ljava/lang/Object;)Ljava/lang/Object;+16
J 56979 C1 org.platanios.tensorflow.api.utilities.package$.using(Lorg/platanios/tensorflow/api/utilities/package$Closeable;Lscala/Function1;)Ljava/lang/Object; (40 bytes) @ 0x00007f87f0aded34 [0x00007f87f0adec20+0x114]
j  org.platanios.tensorflow.api.ops.control_flow.ControlFlow$.updateInput(Lorg/platanios/tensorflow/api/ops/Op;ILorg/platanios/tensorflow/api/ops/Output;)V+18
j  org.platanios.tensorflow.api.ops.control_flow.Context.$anonfun$addInternal$3(Lorg/platanios/tensorflow/api/ops/control_flow/Context;Lorg/platanios/tensorflow/api/ops/Op;Lscala/Tuple2;)V+68
j  org.platanios.tensorflow.api.ops.control_flow.Context.$anonfun$addInternal$3$adapted(Lorg/platanios/tensorflow/api/ops/control_flow/Context;Lorg/platanios/tensorflow/api/ops/Op;Lscala/Tuple2;)Ljava/lang/Object;+3
j  org.platanios.tensorflow.api.ops.control_flow.Context$$Lambda$9282.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
J 45297 C2 scala.collection.IndexedSeqOptimized.foreach(Lscala/Function1;)V (36 bytes) @ 0x00007f87ef81f574 [0x00007f87ef81f440+0x134]
J 56962 C1 scala.collection.mutable.ArrayOps$ofRef.foreach(Lscala/Function1;)V (6 bytes) @ 0x00007f87f0ac0abc [0x00007f87f0ac0a00+0xbc]
j  org.platanios.tensorflow.api.ops.control_flow.Context.addInternal(Lorg/platanios/tensorflow/api/ops/Op;)V+131
j  org.platanios.tensorflow.api.ops.control_flow.Context.add(Lorg/platanios/tensorflow/api/ops/Op;)V+2
j  org.platanios.tensorflow.api.ops.Op$Builder.$anonfun$build$9(Lorg/platanios/tensorflow/api/ops/Op;Lorg/platanios/tensorflow/api/ops/control_flow/Context;)V+2
j  org.platanios.tensorflow.api.ops.Op$Builder.$anonfun$build$9$adapted(Lorg/platanios/tensorflow/api/ops/Op;Lorg/platanios/tensorflow/api/ops/control_flow/Context;)Ljava/lang/Object;+2
j  org.platanios.tensorflow.api.ops.Op$Builder$$Lambda$9163.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
J 51578 C1 scala.Option.foreach(Lscala/Function1;)V (19 bytes) @ 0x00007f87ee638054 [0x00007f87ee637fa0+0xb4]
j  org.platanios.tensorflow.api.ops.Op$Builder.$anonfun$build$1(Lorg/platanios/tensorflow/api/ops/Op$Builder;Lorg/platanios/tensorflow/api/core/Graph$Reference;)Lorg/platanios/tensorflow/api/ops/Op;+338
j  org.platanios.tensorflow.api.ops.Op$Builder$$Lambda$9147.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
J 56979 C1 org.platanios.tensorflow.api.utilities.package$.using(Lorg/platanios/tensorflow/api/utilities/package$Closeable;Lscala/Function1;)Ljava/lang/Object; (40 bytes) @ 0x00007f87f0aded34 [0x00007f87f0adec20+0x114]
j  org.platanios.tensorflow.api.ops.Op$Builder.build()Lorg/platanios/tensorflow/api/ops/Op;+23
j  org.platanios.tensorflow.api.ops.Basic.expandDims(Lorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/ops/Output;Ljava/lang/String;)Lorg/platanios/tensorflow/api/ops/Output;+25
j  org.platanios.tensorflow.api.ops.Basic.expandDims$(Lorg/platanios/tensorflow/api/ops/Basic;Lorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/ops/Output;Ljava/lang/String;)Lorg/platanios/tensorflow/api/ops/Output;+4
j  org.platanios.tensorflow.api.package$tf$.expandDims(Lorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/ops/Output;Ljava/lang/String;)Lorg/platanios/tensorflow/api/ops/Output;+4
j  com.veezoo.parser.ml.base.Attention$.$anonfun$addTimingSignal$1(FFLorg/platanios/tensorflow/api/ops/Output;Lorg/platanios/tensorflow/api/ops/Output;)Lorg/platanios/tensorflow/api/ops/Output;+263
Rest of the stack trace omitted... (contains a tf.cond!)

This error only occurs when on an AWS instance, not when locally, which gives me a lot of headaches. The tensorflow-scala version as well as the native libraries are the same, however the libtensorflow_jni.so which is built is slightly different.

The offending expandDims is found in a tf.cond, when removing the conditional the error disappears.

I took a look at the assembly code:

0000000000068fa0 <_ZN10tensorflow10UpdateEdgeEP8TF_Graph9TF_Output8TF_InputP9TF_Status>:
   68fa0:   41 57                   push   %r15
   68fa2:   41 56                   push   %r14
   68fa4:   49 89 cf                mov    %rcx,%r15
   68fa7:   41 55                   push   %r13
   68fa9:   41 54                   push   %r12
   68fab:   49 89 f5                mov    %rsi,%r13
   68fae:   55                      push   %rbp
   68faf:   53                      push   %rbx
   68fb0:   48 89 fb                mov    %rdi,%rbx
   68fb3:   49 89 d6                mov    %rdx,%r14
   68fb6:   4c 89 cd                mov    %r9,%rbp
   68fb9:   4d 89 c4                mov    %r8,%r12
   68fbc:   48 83 ec 38             sub    $0x38,%rsp
   68fc0:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
   68fc7:   00 00 
   68fc9:   48 89 44 24 28          mov    %rax,0x28(%rsp)
   68fce:   31 c0                   xor    %eax,%eax
   68fd0:   e8 3b 8c fa ff          callq  11c10 <_ZN5nsync13nsync_mu_lockEPNS_11nsync_mu_s_E@plt>
   68fd5:   4c 8b 8b f0 02 00 00    mov    0x2f0(%rbx),%r9
   68fdc:   31 d2                   xor    %edx,%edx
   68fde:   4c 89 e8                mov    %r13,%rax
   68fe1:   48 8b b3 e8 02 00 00    mov    0x2e8(%rbx),%rsi
   68fe8:   49 f7 f1                div    %r9
   68feb:   48 8b 04 d6             mov    (%rsi,%rdx,8),%rax  # SIGSEGV here

Relevant register values:

RSI=0x00000000000000c7
RDX=0x000000002e919cb0

The fault seems to lie in RSI, that points to a protected location (it being the base pointer). Like I said, this only occurs in some environments and under a tf.cond. Also, the input of expandDims comes from outside the conditional, and is thus a switch. Sorry that I can't give you a way to reproduce, I have no idea how to reproduce it myself...

If you know a way to solve the problem, that would be great, if not I will just revert that commit locally. Thanks a lot!

mandar2812 commented 6 years ago

@csaladin94 I can confirm getting the exact same error when running the RNN PTB example on my machine.

eaplatanios commented 6 years ago

@mandar2812 @csaladin94 Which version of TensorFlow are you using? Is it pre-compiled packaged binaries I provide with the library JARs? I think I forgot to update those so will do so ASAP. I'm currently on a trip for Thanksgiving but I'll do it once I get home tomorrow evening.

carlo-veezoo commented 6 years ago

@eaplatanios I'm compiling the jni and ops library myself. Also, I use a pre-compiled tensorflow library from a nightly jenkins build that I downloaded like a week ago. If I revert just the commit "Updated the Python API C++ file", the error disappears.

mandar2812 commented 6 years ago

@eaplatanios In my case its the pre-compiled packaged binaries that you upload.

eaplatanios commented 6 years ago

@csaladin94 @mandar2812 I'm trying to reproduce this but I can't. Could you try re-running with the updated snapshot artifacts? It may be helpful to create a docker container that reproduces it so I can test.

carlo-veezoo commented 6 years ago

@eaplatanios From my side I unfortunately can't give you a docker container. As I said, I'm not able to reproduce it myself, so I'll just work with the commit in question reverted for the moment being, then everything runs fine for me.

eaplatanios commented 6 years ago

@csaladin94 Sounds good. FYI, I tried running on a few different machines and it doesn't seem to reappear. If @mandar2812 can create a reproducible example, I will look more into it. Otherwise, let's wait and if it doesn't reappear ignore it. It may have been that at some point with some version of the main TensorFlow repo code something was broken that's fine now. Nightlies are like that sometimes.

eaplatanios commented 6 years ago

@csaladin94 @mandar2812 I managed to reproduce this on an Ubuntu machine and I'm working on it now. :)

eaplatanios commented 6 years ago

@csaladin94 @mandar2812 It turns out that's due a bug in TensorFlow and I removed the change for now. I'll re-add the checks once it's fixed in the core library. I'll update the artifacts soon.