Closed carlo-veezoo closed 6 years ago
@csaladin94 I can confirm getting the exact same error when running the RNN PTB example on my machine.
@mandar2812 @csaladin94 Which version of TensorFlow are you using? Is it pre-compiled packaged binaries I provide with the library JARs? I think I forgot to update those so will do so ASAP. I'm currently on a trip for Thanksgiving but I'll do it once I get home tomorrow evening.
@eaplatanios I'm compiling the jni and ops library myself. Also, I use a pre-compiled tensorflow library from a nightly jenkins build that I downloaded like a week ago. If I revert just the commit "Updated the Python API C++ file", the error disappears.
@eaplatanios In my case its the pre-compiled packaged binaries that you upload.
@csaladin94 @mandar2812 I'm trying to reproduce this but I can't. Could you try re-running with the updated snapshot artifacts? It may be helpful to create a docker container that reproduces it so I can test.
@eaplatanios From my side I unfortunately can't give you a docker container. As I said, I'm not able to reproduce it myself, so I'll just work with the commit in question reverted for the moment being, then everything runs fine for me.
@csaladin94 Sounds good. FYI, I tried running on a few different machines and it doesn't seem to reappear. If @mandar2812 can create a reproducible example, I will look more into it. Otherwise, let's wait and if it doesn't reappear ignore it. It may have been that at some point with some version of the main TensorFlow repo code something was broken that's fine now. Nightlies are like that sometimes.
@csaladin94 @mandar2812 I managed to reproduce this on an Ubuntu machine and I'm working on it now. :)
@csaladin94 @mandar2812 It turns out that's due a bug in TensorFlow and I removed the change for now. I'll re-add the checks once it's fixed in the core library. I'll update the artifacts soon.
Disclaimer: I'm not able to reproduce this bug in a meaningful way, but I'll just add an issue anyways.
Since commit "Updated the Python API C++ file." I get the following error. When building my graph, I get a SIGSEGV in the native code (hs_err_pidxxxxx.log):
This error only occurs when on an AWS instance, not when locally, which gives me a lot of headaches. The tensorflow-scala version as well as the native libraries are the same, however the libtensorflow_jni.so which is built is slightly different.
The offending
expandDims
is found in atf.cond
, when removing the conditional the error disappears.I took a look at the assembly code:
Relevant register values:
The fault seems to lie in RSI, that points to a protected location (it being the base pointer). Like I said, this only occurs in some environments and under a
tf.cond
. Also, the input ofexpandDims
comes from outside the conditional, and is thus a switch. Sorry that I can't give you a way to reproduce, I have no idea how to reproduce it myself...If you know a way to solve the problem, that would be great, if not I will just revert that commit locally. Thanks a lot!