SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.
https://scisharp.github.io/tensorflow-net-docs
Apache License 2.0
3.22k stars 517 forks source link

SEGV in TF_SessionRun #507

Open yvt opened 4 years ago

yvt commented 4 years ago

My app is crashing with SEGV during a call to TF_SessionRun. The occurrences are rather sporadic and unpredictable - At first it happened once an hour but today I barely managed to reproduce it.

Analysis on the problem suggested that it may be caused by TF_Tensor * passed to TF_SessionRun being deleted prematurely, probably during or just right before the call to TF_SessionRun.

Operating system: NixOS (unstable channel) libtensorflow version: A custom build (because of #505) based on the TensorFlow commit fd05051846fd9ceb090206600afd1a71ba852e20 TensorFlow.NET version: 0.14.2

Analysis

Case 1

This is the first occurrence recorded by me. TF_Tensor::tensor_ is never explicitly assigned nullptr by the code, so this result suggests the existence of a memory error.

Thread 1 "dotnet" received signal SIGSEGV, Segmentation fault.
0x00007ffebdd3bfdc in tensorflow::TensorInterface::ToTensor(tensorflow::Tensor*) const () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
(gdb) bt
#0  0x00007ffebdd3bfdc in tensorflow::TensorInterface::ToTensor(tensorflow::Tensor*) const () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#1  0x00007ffebdd3cb14 in tensorflow::TF_TensorToTensor(TF_Tensor const*, tensorflow::Tensor*) () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#2  0x00007ffeb72d1491 in TF_Run_Inputs(TF_Tensor* const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > >*, TF_Status*) [clone .constprop.647] ()
   from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#3  0x00007ffeb72e1dbd in TF_SessionRun () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#4  0x00007fff7f6535f9 in ?? ()
#5  0x00007fff5a3eeb08 in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x00007fff5a3eec50 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000000000 in ?? ()

(gdb) disassemble
   0x00007ffebdd3bfbd <+13>:    mov    %rdi,%r13
   0x00007ffebdd3bfc0 <+16>:    push   %r12
   0x00007ffebdd3bfc2 <+18>:    push   %rbx
   0x00007ffebdd3bfc3 <+19>:    mov    %rdx,%r12
   0x00007ffebdd3bfc6 <+22>:    sub    $0x138,%rsp
   0x00007ffebdd3bfcd <+29>:    mov    %fs:0x28,%rax
   0x00007ffebdd3bfd6 <+38>:    mov    %rax,-0x38(%rbp)
   0x00007ffebdd3bfda <+42>:    xor    %eax,%eax
=> 0x00007ffebdd3bfdc <+44>:    movzbl 0x15(%rsi),%eax   // read this->tensor_.dtype()
   0x00007ffebdd3bfe0 <+48>:    cmp    $0x14,%eax
   0x00007ffebdd3bfe3 <+51>:    je     0x7ffebdd3c400 <_ZNK10tensorflow15TensorInterface8ToTensorEPNS_6TensorE+1104>
   0x00007ffebdd3bfe9 <+57>:    cmp    $0x7,%eax
   0x00007ffebdd3bfec <+60>:    je     0x7ffebdd3c030 <_ZNK10tensorflow15TensorInterface8ToTensorEPNS_6TensorE+128>
   0x00007ffebdd3bfee <+62>:    lea    0x8(%rsi),%rsi
   0x00007ffebdd3bff2 <+66>:    mov    %r12,%rdi
   0x00007ffebdd3bff5 <+69>:    mov    %rsi,%rdx
   0x00007ffebdd3bff8 <+72>:    callq  0x7ffeb6ff91c0 <_ZN10tensorflow6Tensor16CopyFromInternalERKS0_RKNS_11TensorShapeE@plt>
   0x00007ffebdd3bffd <+77>:    movq   $0x0,0x0(%r13)
   0x00007ffebdd3c005 <+85>:    mov    -0x38(%rbp),%rcx
   0x00007ffebdd3c009 <+89>:    xor    %fs:0x28,%rcx

(gdb) info registers
rax            0x0                 0
rbx            0x7fffffffa450      140737488331856
rcx            0x7fffffffa44f      140737488331855
rdx            0x35db478           56472696
rsi            0x0                 0
rdi            0x7fffffffa450      140737488331856

Case 2

I suspected that the problem might have been caused by TF_Tensor being deleted prematurely. In order to validate this hypothesis, I set a breakpoint at TF_DeleteTensor and configured it to display the value of (AbstractTensorInterface *)&*this->_tensor.

Thread 5 "dotnet" hit Breakpoint 1, 0x00007ffec2d51930 in TF_DeleteTensor () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
0x2f07880:      0x28fb2d0
2020-02-18 12:26:29.890280: F tensorflow/core/framework/tensor.cc:972] Check failed: buf_ null buf_ with non-zero shape size 22417320

Thread 1 "dotnet" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7a5f740 (LWP 31079)]
0x00007ffff7a97bb0 in raise () from /nix/store/y9zg6ryffgc5c9y67fcmfdkyyiivjzpj-glibc-2.27/lib/libc.so.6

(gdb) bt
#0  0x00007ffff7a97bb0 in raise () from /nix/store/y9zg6ryffgc5c9y67fcmfdkyyiivjzpj-glibc-2.27/lib/libc.so.6
#1  0x00007ffff7a98d91 in abort () from /nix/store/y9zg6ryffgc5c9y67fcmfdkyyiivjzpj-glibc-2.27/lib/libc.so.6
#2  0x00007ffebaf6cfd4 in tensorflow::internal::LogMessageFatal::~LogMessageFatal() () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow_framework.so.2
#3  0x00007ffebaa9088e in tensorflow::Tensor::TotalBytes() const () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow_framework.so.2
#4  0x00007ffebaa909db in tensorflow::Tensor::AllocatedBytes() const () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow_framework.so.2
#5  0x00007ffec2cb1f03 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) ()
   from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#6  0x00007ffec2c9e983 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#7  0x00007ffebc2f77f7 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) [clone .constprop.646] () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#8  0x00007ffebc2f7fa9 in TF_SessionRun () from /nix/store/qm2v25z1r517ggyqk84z3ss5zxflfzyf-libtensorflow-1.15.2/lib/libtensorflow.so
#9  0x00007fff7f640d99 in ?? ()
#10 0x00007fff59eecda0 in ?? ()
#11 0x0000000000000001 in ?? ()
#12 0x00007fff59ef8640 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x0000000000000000 in ?? ()

(gdb) f 4
(gdb) x/a $r14 <---- &((it + 1).first)
0x26b0968:      0x28fb2e8
(gdb) x/s 0x28fb2e8
0x28fb2e8:      "Input/Target:0"

Frame #9 is where the .NET runtime calls TF_SessionRun:

(gdb) f 9
(gdb) x/20i $pc-71
   0x7fff7f640d52:      mov    %r14,%rdi   <--- param[0]: session
   0x7fff7f640d55:      mov    %r15,%rsi   <---- param[1]: run_options
   0x7fff7f640d58:      mov    -0x98(%rbp),%rdx <--- param[2]: inputs
   0x7fff7f640d5f:      mov    -0xa0(%rbp),%rcx <--- param[3]: input_values
   0x7fff7f640d66:      mov    %rbx,%r8         <--- param[4]: ninputs
   0x7fff7f640d69:      mov    -0xb8(%rbp),%r9  <--- param[5]: outputs
   0x7fff7f640d70:      mov    $0x30,%r11d
   0x7fff7f640d76:      mov    -0xf0(%rbp),%r10
   0x7fff7f640d7d:      mov    %r10,-0x120(%rbp)
   0x7fff7f640d84:      lea    0xe(%rip),%r10        # 0x7fff7f640d99
   0x7fff7f640d8b:      mov    %r10,-0x108(%rbp)
   0x7fff7f640d92:      movb   $0x0,0xc(%r13)
   0x7fff7f640d97:      callq  *%rax    <----- TF_SessionRun
=> 0x7fff7f640d99:      movb   $0x1,0xc(%r13)
   0x7fff7f640d9e:      movabs $0x7ffff756ef60,%rdi
   0x7fff7f640da8:      cmpl   $0x0,(%rdi)
   0x7fff7f640dab:      je     0x7fff7f640db9
   0x7fff7f640dad:      movabs $0x7ffff7561188,%rsi
   0x7fff7f640db7:      callq  *(%rsi)
   0x7fff7f640db9:      nop

(gdb) x/a $rbp-0xa0
0x7fffffffa570: 0x7fff59eece08
(gdb) x/4a 0x7fff59eece08         (input_values)
0x7fff59eece08: 0x2f07880       0x2ac5e30
0x7fff59eece18: 0x2b21c60       0x0

Notice the pointer value 0x2f07880 (of type TF_Tensor *) here. As shown earlier, 0x2f07880 had already been deleted by TF_DeleteTensor. Thus, this result confirms my suspicion that Tensors are garbage-collected too early, causing SEGV and assertion failures in TF_SessionRun when accessing the deleted TF_Tensor.

Oceania2018 commented 4 years ago

Do you call TF_SessionRun directly from your code?

yvt commented 4 years ago

No, my code uses the method Session.run.

Oceania2018 commented 4 years ago

Is it possible to write a UnitTest to run again an again until it reproduce this issue.

Do you use new Tensor() to initialize a tensor? As I know, this operation may cause some memory issue.

BTW, how do you debug this kind of issue, what tools are yoI using?

@yvt Can we talk in https://gitter.im/sci-sharp/community ?

yvt commented 4 years ago

Is it possible to write a UnitTest to run again an again until it reproduce this issue.

Here you go: https://gist.github.com/yvt/2156547616e2a035ab6be196d6e1e6e3 This test program reliably (10 out of 10 test runs) crashes in 2.46 seconds on average on my machine. Most of time it crashes in a similar manner to this issue. In other cases it crashes for other reasons such as the following:

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at NumSharp.NDArray+<>c__DisplayClass312_2`1[[System.Single, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].<FetchIn
dices>b__3(Int32)
   at System.Threading.Tasks.Parallel+<>c__DisplayClass19_0`1[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7
798e]].<ForWorker>b__1(System.Threading.Tasks.RangeWorker ByRef, Int32, Boolean ByRef)
   at System.Threading.Tasks.TaskReplicator+Replica`1[[System.Threading.Tasks.RangeWorker, System.Threading.Tasks.Parallel, Version=4.0.4.0, Culture=neutral, Public
KeyToken=b03f5f7f11d50a3a]].ExecuteAction(Boolean ByRef)

Do you use new Tensor() to initialize a tensor? As I know, this operation may cause some memory issue.

No, it's not used anywhere in my code.

BTW, how do you debug this kind of issue, what tools are yoI using?

Just gdb. Visual Studio would be more useful since it can debug managed/unmanaged-mixed code (IIRC). I wish there was a better way to deal with tracing GC's indeterminism.