Open yvt opened 4 years ago
Do you call TF_SessionRun directly from your code?
No, my code uses the method Session.run
.
Is it possible to write a UnitTest to run again an again until it reproduce this issue.
Do you use new Tensor() to initialize a tensor? As I know, this operation may cause some memory issue.
BTW, how do you debug this kind of issue, what tools are yoI using?
@yvt Can we talk in https://gitter.im/sci-sharp/community ?
Is it possible to write a UnitTest to run again an again until it reproduce this issue.
Here you go: https://gist.github.com/yvt/2156547616e2a035ab6be196d6e1e6e3 This test program reliably (10 out of 10 test runs) crashes in 2.46 seconds on average on my machine. Most of time it crashes in a similar manner to this issue. In other cases it crashes for other reasons such as the following:
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at NumSharp.NDArray+<>c__DisplayClass312_2`1[[System.Single, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].<FetchIn
dices>b__3(Int32)
at System.Threading.Tasks.Parallel+<>c__DisplayClass19_0`1[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7
798e]].<ForWorker>b__1(System.Threading.Tasks.RangeWorker ByRef, Int32, Boolean ByRef)
at System.Threading.Tasks.TaskReplicator+Replica`1[[System.Threading.Tasks.RangeWorker, System.Threading.Tasks.Parallel, Version=4.0.4.0, Culture=neutral, Public
KeyToken=b03f5f7f11d50a3a]].ExecuteAction(Boolean ByRef)
Do you use new Tensor() to initialize a tensor? As I know, this operation may cause some memory issue.
No, it's not used anywhere in my code.
BTW, how do you debug this kind of issue, what tools are yoI using?
Just gdb
. Visual Studio would be more useful since it can debug managed/unmanaged-mixed code (IIRC). I wish there was a better way to deal with tracing GC's indeterminism.
My app is crashing with SEGV during a call to
TF_SessionRun
. The occurrences are rather sporadic and unpredictable - At first it happened once an hour but today I barely managed to reproduce it.Analysis on the problem suggested that it may be caused by
TF_Tensor *
passed toTF_SessionRun
being deleted prematurely, probably during or just right before the call toTF_SessionRun
.Operating system: NixOS (unstable channel)
libtensorflow
version: A custom build (because of #505) based on the TensorFlow commitfd05051846fd9ceb090206600afd1a71ba852e20
TensorFlow.NET version: 0.14.2Analysis
Case 1
This is the first occurrence recorded by me.
TF_Tensor::tensor_
is never explicitly assignednullptr
by the code, so this result suggests the existence of a memory error.Case 2
I suspected that the problem might have been caused by
TF_Tensor
being deleted prematurely. In order to validate this hypothesis, I set a breakpoint atTF_DeleteTensor
and configured it to display the value of(AbstractTensorInterface *)&*this->_tensor
.Frame #9 is where the .NET runtime calls
TF_SessionRun
:Notice the pointer value
0x2f07880
(of typeTF_Tensor *
) here. As shown earlier,0x2f07880
had already been deleted byTF_DeleteTensor
. Thus, this result confirms my suspicion thatTensor
s are garbage-collected too early, causing SEGV and assertion failures inTF_SessionRun
when accessing the deletedTF_Tensor
.