eaplatanios / tensorflow_scala

TensorFlow API for the Scala Programming Language
http://platanios.org/tensorflow_scala/
Apache License 2.0
937 stars 95 forks source link

After fix issue #33, error during save model #34

Closed lucataglia closed 6 years ago

lucataglia commented 6 years ago

After the fix due to the issue #33 ...

...
val savePath = Paths.get("./model-store/my-model")
val saver = tf.saver()
saver.save(session, savePath)
...

...when I run the code above at the end of the LinearRegression example I get this error:

Exception in thread "main" org.platanios.tensorflow.jni.NotFoundException: ~/Desktop/tensorflowHome/ScalaTF/./model-store/checkpoint.tmp060d4452-38ab-4fe3-923f-40f3653ef996; No such file or directory
    at org.platanios.tensorflow.jni.FileIO$.deleteFile(Native Method)
    at org.platanios.tensorflow.api.io.FileIO$.deleteFile(FileIO.scala:251)
    at org.platanios.tensorflow.api.io.FileIO$.writeStringToFileAtomic(FileIO.scala:480)
    at org.platanios.tensorflow.api.ops.variables.Saver$.org$platanios$tensorflow$api$ops$variables$Saver$$updateCheckpointStateFile(Saver.scala:646)
    at org.platanios.tensorflow.api.ops.variables.Saver.save(Saver.scala:187)
    at linearRegression.LinearRegression$.main(LinearRegression.scala:61)
    at linearRegression.LinearRegression.main(LinearRegression.scala)

I also try on another, not before used for this project, MacSierra to check if was me that I make a mess with the environment and I got a complete different error just at the first line of the main of the LinearRegression example:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x000000011df1c1da, pid=54720, tid=0x0000000000001c03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow.so+0x181da]  TF_GraphNextOperation+0x1a
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
eaplatanios commented 6 years ago

@lucaRadicalbit Sorry that was introduced by some other changes I've been making but should be fixed now.

Regarding your Mac Sierra error, that's something else and I think it relates to one of the issues you posted earlier?

lucataglia commented 6 years ago

I don't know, also I got this error quite randomly, because now I do not get the error in the code where I save the model, but I get the error when I run the code that restore the model

val session = Session()

val metaGraphDef = MetaGraphDef.parseFrom(new BufferedInputStream(new FileInputStream(metaPath.toFile)))
val saver = tf.Saver.fromMetaGraphDef(metaGraphDef)
saver.restore(session, checkpointPath)

metaPath and checkpointPath are defined above in my code as Paths.get("...")

2017-10-26 23:59:44.644 [main] INFO  Examples / Linear Regression - Building linear regression model.
2017-10-26 23:59:44.972 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_framework' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_framework.so.
2017-10-26 23:59:45.094 [main] INFO  TensorFlow Native - Copied 11419400 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_framework.so.
2017-10-26 23:59:45.104 [main] INFO  TensorFlow Native - Extracting the 'tensorflow' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow.so.
2017-10-26 23:59:45.780 [main] INFO  TensorFlow Native - Copied 102600676 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow.so.
2017-10-26 23:59:45.873 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_jni' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_jni.so.
2017-10-26 23:59:45.878 [main] INFO  TensorFlow Native - Copied 594544 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_jni.so.
2017-10-26 23:59:45.881 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_ops' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_ops.so.
2017-10-26 23:59:45.882 [main] INFO  TensorFlow Native - Copied 51580 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries7531236854401793745/libtensorflow_ops.so.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x000000011e0621da, pid=2497, tid=0x0000000000001b03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow.so+0x181da]  TF_GraphNextOperation+0x1a
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# ~/Desktop/tensorflowHome/ScalaTF/hs_err_pid2497.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
lucataglia commented 6 years ago

I try to comment all the code inside my main method and then remove the comment one line at the time and it seem that the line that generate this strange error is val saver = tf.Saver.fromMetaGraphDef(metaGraphDef). When I remove the comment to that line I get the fatal error. Did I correctly implement the procedure to retrieve a previous train TF model save by tf.saver() ?

val session = Session()

val metaGraphDef = MetaGraphDef.parseFrom(new BufferedInputStream(new FileInputStream(metaPath.toFile)))
val saver = tf.Saver.fromMetaGraphDef(metaGraphDef)
saver.restore(session, checkpointPath)
eaplatanios commented 6 years ago

@lucaRadicalbit I have to run right now but I'll respond as soon as I get home and sort this out. Do you get this error on multiple machines or just that one Mac Sierra machine?

lucataglia commented 6 years ago

@eaplatanios I got this error on two machine but boot Mac Sierra 10.12.6. The strange thing is that I ask my colleague that never work on TF or Scala API (so his Mac was completely clean) to run the same code (he has also a Mac Sierra) and he didn't get the error. Maybe is something related to the fact that I am working with the API a lot in the last month and maybe the libtensorflow.so and the API on my machine/project that I retrieve by libraryDependencies += "org.platanios" %% "tensorflow" % "0.1.0-SNAPSHOT" classifier "darwin-cpu-x86_64" got damaged in some strange way (I don't know a lot about, it's just a suggestion) ?

eaplatanios commented 6 years ago

@lucaRadicalbit Oh I see. No actually I think I know what's happening. I keep updating the published JAR files in the Sonatype repository. That's where SBT pulls that dependency from. In your case, I think that it has cached an old version and is not pulling the newer ones because the version number is the same. There is a couple of things you can try. One is to change the dependency to libraryDependencies += "org.platanios" %% "tensorflow" % "0.1.0-SNAPSHOT" classifier "darwin-cpu-x86_64" changing(). This should pull the dependency every time you run I think, but I'm not 100% of the expected behavior. Another is to delete the cached version from ~/.ivy2/cache or wherever else it may be stored on your machine. I hope this helps. :)

lucataglia commented 6 years ago

I even try to create a new user on my Mac and reinstall all the tool required to run the project but the Java native code error still remain. Can I ask you try add this line on you LinearRegression.scala:

val savePath = Paths.get("./model-store/my-model")
val saver = tf.saver()
saver.save(session, savePath)

and create a new Scala file:

object LinearRegression_restoreModel {
  def main(args: Array[String]): Unit = {
      val metaPath = Paths.get("./model-store/my-model.meta")
      val metaGraphDef = MetaGraphDef.parseFrom(new BufferedInputStream(new FileInputStream(metaPath.toFile)))
      tf.Saver.fromMetaGraphDef(metaGraphDef)
  }
}

And try to run both and tell me if you get the Java native code error ? I try to understand what does mean Process finished with exit code 134 (interrupted by signal 6: SIGABRT) and it seem that some C function like malloc() or free() does some invalid operation.

Here below I report the error I get:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00000001207a61da, pid=5397, tid=0x0000000000000e03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow.so+0x181da]  TF_GraphNextOperation+0x1a
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/databiz/Desktop/tensorflowHome/STF/hs_err_pid5397.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
lucataglia commented 6 years ago

@eaplatanios I read this issue you opened some month ago: https://github.com/tensorflow/tensorflow/issues/9842

Is possible that some option you use to compile the tensor flow dynamic library that you gave us are problematic with my architecture ? In other words, is possible that is an hardware problem and the only way that I can use you Scala API is manually compile the tensor flow library by myself on my Mac ?

eaplatanios commented 6 years ago

@lucaRadicalbit Actually there is definitely a problem somewhere as I can also currently not load it correctly on a Linux server I'm currently testing on. I'm trying to resolve it right now so I'll update you on this with what I find.

eaplatanios commented 6 years ago

@lucaRadicalbit I think it's fixed now. Could you please check again?

eaplatanios commented 6 years ago

Did you clear the Ivy cache before rerunning so that sbt picks up the new jars?

On Nov 2, 2017, 7:24 PM -0400, Luca Tagliabue notifications@github.com, wrote:

I still get the same error on the same instruction TF_GraphNextOperation+0x1a 2017-11-03 00:22:28.748 [main] INFO TensorFlow Native - Extracting the 'tensorflow_framework' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_framework.so. 2017-11-03 00:22:28.898 [main] INFO TensorFlow Native - Copied 11419400 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_framework.so. 2017-11-03 00:22:28.916 [main] INFO TensorFlow Native - Extracting the 'tensorflow' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow.so. 2017-11-03 00:22:29.580 [main] INFO TensorFlow Native - Copied 102600676 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow.so. 2017-11-03 00:22:29.668 [main] INFO TensorFlow Native - Extracting the 'tensorflow_jni' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_jni.so. 2017-11-03 00:22:29.673 [main] INFO TensorFlow Native - Copied 594544 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_jni.so. 2017-11-03 00:22:29.676 [main] INFO TensorFlow Native - Extracting the 'tensorflow_ops' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_ops.so. 2017-11-03 00:22:29.677 [main] INFO TensorFlow Native - Copied 51580 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5993711855733396499/libtensorflow_ops.so. #

A fatal error has been detected by the Java Runtime Environment:

#

SIGILL (0x4) at pc=0x000000011cb1a1da, pid=1123, tid=0x0000000000001c03

#

JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode bsd-amd64 compressed oops)

Problematic frame:

C [libtensorflow.so+0x181da] TF_GraphNextOperation+0x1a

#

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

An error report file with more information is saved as:

/Users/databiz/Desktop/tensorflowHome/STF/hs_err_pid1123.log

#

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT) If you think I can do some trying in some way tell me so I can help you find where the problem is — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lucataglia commented 6 years ago

Now I try, I let you know

lucataglia commented 6 years ago

@eaplatanios It works !!!!!!!

2017-11-03 00:41:11.294 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_framework' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_framework.so.
2017-11-03 00:41:11.611 [main] INFO  TensorFlow Native - Copied 11295472 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_framework.so.
2017-11-03 00:41:11.624 [main] INFO  TensorFlow Native - Extracting the 'tensorflow' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow.so.
2017-11-03 00:41:12.455 [main] INFO  TensorFlow Native - Copied 109804152 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow.so.
2017-11-03 00:41:12.558 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_jni' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_jni.so.
2017-11-03 00:41:12.574 [main] INFO  TensorFlow Native - Copied 594544 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_jni.so.
2017-11-03 00:41:12.577 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_ops' native library to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_ops.so.
2017-11-03 00:41:12.577 [main] INFO  TensorFlow Native - Copied 51580 bytes to /var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries5665878545241215796/libtensorflow_ops.so.
2017-11-03 00:41:12.598214: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2017-11-03 00:41:13.071 [main] INFO  Variables / Saver - Restoring parameters from './model-store/my-model'.

Process finished with exit code 0

Can I ask you which was the problem ?

eaplatanios commented 6 years ago

@lucaRadicalbit That's great to hear! :))

So, I found two issues and I'm not sure which one was affecting your case, but I fixed both:

  1. When loading the TensorFlow dynamic libraries in Scala, the libraries are loaded by default using dlopen with the RTLD_LOCAL set. Then, when I load my JNI bindings dynamic library, which depends on them, they get reloaded again by dlopen, because it's not aware that the Scala side loaded them earlier. That can cause issues. In order to load them only once across the whole system, they need to be loaded using the RTLD_GLOBAL flag, which is (surprisingly to me) not supported at all by the JDK. The only way around that is to create a separate library with JNI bindings to dlopen and load the libraries using that instead of System.load(...) that the JDK provides. Instead of doing that which is a bit messy, I ended up using the Java Native Access (JNA) project which provides something like that. We can avoid that dependency by implementing bindings to dlopen, but I thought it's not worth the effort.
  2. There was one case in my JNI bindings where I was using the JNI NewStringUTF method with a character array that was missing the null character at the end. This caused some unexpected crashes on Linux machines. It should also cause problems on Mac machines; I was simply lucky enough to not experience them so far.

I hope this helps, but in any case, I'm happy we got this resolved. :)