eaplatanios / tensorflow_scala

TensorFlow API for the Scala Programming Language
http://platanios.org/tensorflow_scala/
Apache License 2.0
936 stars 96 forks source link

Error when loading TF-Scala on GPU #64

Closed mandar2812 closed 6 years ago

mandar2812 commented 6 years ago

I am trying to run the cifar example after loading the latest snapshot, I upgraded cuda to v9.0 and libcudnn7-cuda9.0 I get the following error.

Compiling /home/dynaml/code/DynaML/scripts/cifar.sc
2017-12-14 16:30:09.620 [main] INFO  CIFAR Data Loader - Extracting data from file '/users/ao/mandar/tmp/cifar-10-binary.tar.gz'.
2017-12-14 16:30:10.210 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_framework' native library to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_framework.so.
2017-12-14 16:30:10.345 [main] INFO  TensorFlow Native - Copied 16650560 bytes to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_framework.so.
2017-12-14 16:30:10.346 [main] INFO  TensorFlow Native - Extracting the 'tensorflow' native library to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow.so.
2017-12-14 16:30:11.213 [main] INFO  TensorFlow Native - Copied 128195352 bytes to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow.so.
2017-12-14 16:30:11.215 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_jni' native library to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_jni.so.
2017-12-14 16:30:11.221 [main] INFO  TensorFlow Native - Copied 678224 bytes to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_jni.so.
2017-12-14 16:30:11.301 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_ops' native library to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_ops.so.
2017-12-14 16:30:11.302 [main] INFO  TensorFlow Native - Copied 74824 bytes to /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_ops.so.
org.platanios.tensorflow.jni.NotFoundException: /tmp/tensorflow_scala_native_libraries3743767381021945609/libtensorflow_ops.so: undefined symbol: _ZN10tensorflow7strings8internal9CatPiecesESt16initializer_listINS_11StringPieceEE
  org.platanios.tensorflow.jni.TensorFlow$.loadOpLibrary(Native Method)
  org.platanios.tensorflow.jni.TensorFlow$$anonfun$load$4.apply(TensorFlow.scala:107)
  org.platanios.tensorflow.jni.TensorFlow$$anonfun$load$4.apply(TensorFlow.scala:107)
  scala.Option.foreach(Option.scala:257)
  org.platanios.tensorflow.jni.TensorFlow$.load(TensorFlow.scala:107)
  org.platanios.tensorflow.jni.TensorFlow$.<init>(TensorFlow.scala:155)
  org.platanios.tensorflow.jni.TensorFlow$.<clinit>(TensorFlow.scala)
  org.platanios.tensorflow.jni.Tensor$.<init>(Tensor.scala:24)
  org.platanios.tensorflow.jni.Tensor$.<clinit>(Tensor.scala)
  org.platanios.tensorflow.api.tensors.Context$.apply(Context.scala:50)
  org.platanios.tensorflow.api.package$.<init>(package.scala:89)
  org.platanios.tensorflow.api.package$.<clinit>(package.scala)
  org.platanios.tensorflow.data.image.CIFARLoader$.readImagesAndLabels(CIFARLoader.scala:119)
  org.platanios.tensorflow.data.image.CIFARLoader$.extractFiles(CIFARLoader.scala:86)
  org.platanios.tensorflow.data.image.CIFARLoader$.load(CIFARLoader.scala:73)
  ammonite.$file.$up.DynaML.scripts.cifar$.<init>(cifar.sc:12)
  ammonite.$file.$up.DynaML.scripts.cifar$.<clinit>(cifar.sc)

DynaML> 
mandar2812 commented 6 years ago

I dont get what this weird symbol _ZN10tensorflow7strings8internal9CatPiecesESt16initializer_listINS_11StringPieceEE is.

eaplatanios commented 6 years ago

@mandar2812 So that symbol should be in the native TF lib. Could you try running the CPU-only version and check if it works?

mandar2812 commented 6 years ago

@eaplatanios Yes the cpu version works just fine on my laptop

mandar2812 commented 6 years ago

Infact on the GPU machine, after re-building with gpuFlag = false, in effect using CPU only mode, the examples run just fine.

mandar2812 commented 6 years ago

@eaplatanios Is the latest snapshot working fine on your GPU machine? I think it might be a problem with my setup, I had to uninstall CUDA8 and install CUDA9, there might be some hanging pointers or what not. On the other hand the code does not complain about finding libcublas.so.9.0, so I'm not confident that this is a problem solely on my setup. In any case, let me know what you think :)

eaplatanios commented 6 years ago

@mandar2812 It looks like it can't load a TensorFlow symbol so let me look into it. I think it may be an issue with my cross-compilation setup for the GPU version.

eaplatanios commented 6 years ago

@mandar2812 If you rebuild the native TensorFlow library locally with GPU support and add it to your LD_LIBRARY_PATH, does it work fine?

mandar2812 commented 6 years ago

@eaplatanios I compiled Tensorflow with GPU support from source, TF-Scala is still giving me the same error when I try to run cifar on GPU.

DynaML>import org.platanios.tensorflow.api._ 
import org.platanios.tensorflow.api._

DynaML>val tensor = Tensor.zeros(INT32, Shape(2, 5)) 
2017-12-17 21:31:29.693 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_jni' native library to /tmp/tensorflow_scala_native_libraries3692308603386062965/libtensorflow_jni.so.
2017-12-17 21:31:29.706 [main] INFO  TensorFlow Native - Copied 673416 bytes to /tmp/tensorflow_scala_native_libraries3692308603386062965/libtensorflow_jni.so.
2017-12-17 21:31:29.788 [main] INFO  TensorFlow Native - Extracting the 'tensorflow_ops' native library to /tmp/tensorflow_scala_native_libraries3692308603386062965/libtensorflow_ops.so.
2017-12-17 21:31:29.789 [main] INFO  TensorFlow Native - Copied 74824 bytes to /tmp/tensorflow_scala_native_libraries3692308603386062965/libtensorflow_ops.so.
org.platanios.tensorflow.jni.NotFoundException: /tmp/tensorflow_scala_native_libraries3692308603386062965/libtensorflow_ops.so: undefined symbol: _ZN10tensorflow7strings8internal9CatPiecesESt16initializer_listINS_11StringPieceEE
  org.platanios.tensorflow.jni.TensorFlow$.loadOpLibrary(Native Method)
  org.platanios.tensorflow.jni.TensorFlow$$anonfun$load$4.apply(TensorFlow.scala:107)
  org.platanios.tensorflow.jni.TensorFlow$$anonfun$load$4.apply(TensorFlow.scala:107)
  scala.Option.foreach(Option.scala:257)
  org.platanios.tensorflow.jni.TensorFlow$.load(TensorFlow.scala:107)
  org.platanios.tensorflow.jni.TensorFlow$.<init>(TensorFlow.scala:155)
  org.platanios.tensorflow.jni.TensorFlow$.<clinit>(TensorFlow.scala)
  org.platanios.tensorflow.jni.Tensor$.<init>(Tensor.scala:24)
  org.platanios.tensorflow.jni.Tensor$.<clinit>(Tensor.scala)
  org.platanios.tensorflow.api.tensors.Context$.apply(Context.scala:50)
  org.platanios.tensorflow.api.package$.<init>(package.scala:89)
  org.platanios.tensorflow.api.package$.<clinit>(package.scala)
  ammonite.$sess.cmd1$.<init>(cmd1.sc:1)
  ammonite.$sess.cmd1$.<clinit>(cmd1.sc)
eaplatanios commented 6 years ago

@mandar2812 @sbrunk Can you guys check if the newly released artifacts work fine?

sbrunk commented 6 years ago

Unfortunately, the same error still occurs with the new artifacts. Works fine on CPU as before.

mandar2812 commented 6 years ago

I'm heading to work, will be able to try this out in an hour or two. I will clean my coursier cache and give the new artifacts a try in CPU and GPU mode.

On Mon 18 Dec, 2017, 9:34 AM Sören Brunk, notifications@github.com wrote:

Unfortunately, the same error still occurs with the new artifacts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eaplatanios/tensorflow_scala/issues/64#issuecomment-352358562, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUz8fLc1yv_7k3uNoeScC8IMlAurO6Gks5tBiOQgaJpZM4RCO-o .

mandar2812 commented 6 years ago

@eaplatanios @sbrunk I can confirm on my end that the CPU version works well, while the GPU version still does not work. I get the same symbol not found error. I have tested the GPU using the packaged tensorflow_2.11-0.1.0-SNAPSHOT-linux-gpu-x86_64.jar and locally compiled tensorflow.

mandar2812 commented 6 years ago

@eaplatanios This symbol _ZN10tensorflow7strings8internal9CatPiecesESt16initializer_listINS_11StringPieceEE seems to exist in the CPU only version of the native tensorflow lib (or its not required at all).

In the CUDA enabled native Tensorflow lib on the other hand this is not found. The real change here as far as I can tell is moving from CUDA 8 to CUDA 9.x series.

Is this issue also encountered in the tensorflow python communities as well? If there are some breaking changes between CUDA 8 to CUDA 9.x then I expect them to be reflected in the larger TF community as well.

Anyways, I hope all of this verbiage helps! I think you have a much better idea of what is going wrong!

eaplatanios commented 6 years ago

@sbrunk @mandar2812 Sorry I was traveling back home the last 3 days. So this is weird because right now I'm not even cross-compiling the GPU binaries myself. I'm rather using the precompiled TensorFlow ones. It may be a bug with TensorFlow native itself, but I will investigate a bit further and report back.

mandar2812 commented 6 years ago

@eaplatanios @sbrunk : How about we pick up where we left off before the holidays? I can start by testing the latest tensorflow-scala_2.11 snapshot with packaged and self compiled tensorflow binaries on my GPU machine.

@eaplatanios Once you get a chance to update the 2.11 snapshots, give me the go ahead and I can begin.

Also, lets try to find out any matching issue reports/bugs filed on the tensorflow repository, this seems like something which might already have been resolved in the meanwhile.

eaplatanios commented 6 years ago

@mandar2812 That sounds good! Sorry for the delayed response but I just got back. I had a few flights cancelled. I'll look into this resolving this issue asap and updating the binaries.

mandar2812 commented 6 years ago

@eaplatanios @sbrunk : So in the tensorflow v1.5.0-rc.1 release notes, they mention that pre-built binaries are now built against CUDA 9 and cuDNN 7. Does tf-scala use the tensorflow 1.4 series or 1.5?

eaplatanios commented 6 years ago

@mandar2812 I'm still waiting on a server upgrade by the CMU tech support that manages our servers and cannot test with CUDA 9 yet. I pinged them today so I hope this will be resolved soon and I'll be able to test and release new binaries.

eaplatanios commented 6 years ago

@mandar2812 @sbrunk I just released new artifacts that have been compiled with CUDA 9 and CuDNN 7. Could you please check if you still have the same problem?

sbrunk commented 6 years ago

Just cleared my Ivy cache and did a quick test. Unfortunately the error is still there. And now it also appears when running on the CPU, at least on Linux.

mandar2812 commented 6 years ago

I am also still facing the same issue on GPU. Although the CPU only code works fine on my laptop. @eaplatanios Are you able to run GPU code on your end?

mandar2812 commented 6 years ago

@eaplatanios On Mac the CPU code works fine. On Linux x64, GPU and CPU codes give errors as @sbrunk described.

eaplatanios commented 6 years ago

@mandar2812 @sbrunk Ok so I think it's fixed now. I also submitted a PR in the main TensorFlow repository. Sorry for the back and forth but it was not easy to find a way to reproduce (I had to use an older GCC version). Could you please test again so we can eventually close this issue? :)

sbrunk commented 6 years ago

I wish I could report better news but I'm still getting the error using the new artifacts. :(

eaplatanios commented 6 years ago

@sbrunk That's unfortunate! Could you please send me the error message you're receiving?

sbrunk commented 6 years ago

Tested on an Ubuntu 16.04 machine and one running 17.10. Seems to happen on GPU and CPU.

2018-01-23 23:26:11.677 [run-main-0] INFO  TensorFlow Native - Extracting the 'tensorflow_framework' native library to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_framework.so.
2018-01-23 23:26:11.822 [run-main-0] INFO  TensorFlow Native - Copied 15846248 bytes to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_framework.so.
2018-01-23 23:26:11.822 [run-main-0] INFO  TensorFlow Native - Extracting the 'tensorflow' native library to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow.so.
2018-01-23 23:26:12.087 [run-main-0] INFO  TensorFlow Native - Copied 50752464 bytes to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow.so.
2018-01-23 23:26:12.089 [run-main-0] INFO  TensorFlow Native - Extracting the 'tensorflow_jni' native library to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_jni.so.
2018-01-23 23:26:12.093 [run-main-0] INFO  TensorFlow Native - Copied 633640 bytes to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_jni.so.
2018-01-23 23:26:12.140 [run-main-0] INFO  TensorFlow Native - Extracting the 'tensorflow_ops' native library to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_ops.so.
2018-01-23 23:26:12.141 [run-main-0] INFO  TensorFlow Native - Copied 76744 bytes to /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_ops.so.
[error] (run-main-0) java.lang.ExceptionInInitializerError
[error] java.lang.ExceptionInInitializerError
[error]         at org.platanios.tensorflow.jni.Tensor$.<init>(Tensor.scala:24)
[error]         at org.platanios.tensorflow.jni.Tensor$.<clinit>(Tensor.scala)
[error]         at org.platanios.tensorflow.api.tensors.Context$.apply(Context.scala:50)
[error]         at org.platanios.tensorflow.api.package$.<init>(package.scala:89)
[error]         at org.platanios.tensorflow.api.package$.<clinit>(package.scala)
[error]         at Main$.main(Main.scala:35)
[error]         at Main.main(Main.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error]         at sbt.Run.invokeMain(Run.scala:89)
[error]         at sbt.Run.run0(Run.scala:83)
[error]         at sbt.Run.execute$1(Run.scala:61)
[error]         at sbt.Run.$anonfun$run$4(Run.scala:73)
[error]         at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error]         at sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
[error]         at sbt.TrapExit$App.run(TrapExit.scala:252)
[error]         at java.lang.Thread.run(Thread.java:748)
[error] Caused by: org.platanios.tensorflow.jni.NotFoundException: /tmp/tensorflow_scala_native_libraries7103045852428031493/libtensorflow_ops.so: undefined symbol: _ZN10tensorflow7strings8internal9CatPiecesESt16initializer_listINS_11StringPieceEE
[error]         at org.platanios.tensorflow.jni.TensorFlow$.loadOpLibrary(Native Method)
[error]         at org.platanios.tensorflow.jni.TensorFlow$.$anonfun$load$6(TensorFlow.scala:107)
[error]         at scala.Option.foreach(Option.scala:257)
[error]         at org.platanios.tensorflow.jni.TensorFlow$.load(TensorFlow.scala:107)
[error]         at org.platanios.tensorflow.jni.TensorFlow$.<init>(TensorFlow.scala:155)
[error]         at org.platanios.tensorflow.jni.TensorFlow$.<clinit>(TensorFlow.scala)
[error]         at org.platanios.tensorflow.jni.Tensor$.<init>(Tensor.scala:24)
[error]         at org.platanios.tensorflow.jni.Tensor$.<clinit>(Tensor.scala)
[error]         at org.platanios.tensorflow.api.tensors.Context$.apply(Context.scala:50)
[error]         at org.platanios.tensorflow.api.package$.<init>(package.scala:89)
[error]         at org.platanios.tensorflow.api.package$.<clinit>(package.scala)
[error]         at Main$.main(Main.scala:35)
[error]         at Main.main(Main.scala)
eaplatanios commented 6 years ago

@sbrunk I think the JAR files in your case might not have been updated. Are you sure the Ivy cache is cleared and the artifacts re-downloaded? If you're using the SBT coursier plugin could you temporarily disable it?

eaplatanios commented 6 years ago

@sbrunk You could try using version 0.1.0 (without -SNAPSHOT) as I just made a full release to test my release pipeline. That should pull in the new artifacts.

sbrunk commented 6 years ago

Disabled coursier, and checked that the Ivy cache was empty. It just downloaded the artifacts. Still no luck.

sbrunk commented 6 years ago

@eaplatanios ok I'll try that now.

sbrunk commented 6 years ago

Still the same error with the release artifacts.

sbrunk commented 6 years ago

I have to go now but I can continue tomorrow If there's anything else I could try to help.

mandar2812 commented 6 years ago

I am also still getting the same errors as @sbrunk. @eaplatanios: It would be helpful if you could confirm that you are able to run GPU code, then we can narrow down the problem to either of two possibilities:

  1. Proper updated artifacts not being uploaded to Sonatype
  2. Configuration issue in mine & @sbrunk's systems, coursier/ivy2, etc.

P.S I am still available to test any changes pushed in a few minutes.

eaplatanios commented 6 years ago

@sbrunk @mandar2812 I am able to run GPU code on my servers. The problem is hard to reproduce. I was able to do so by compiling my binaries with a different version of GCC. I'll try to reproduce again and update binaries on 0.1.1-SNAPSHOT. My PR was merged in the main TensorFlow repository, so after tonight the default binaries they distribute should also be fixed and I'll not be required to cross-compile everything. In either case, I'll keep you posted.

eaplatanios commented 6 years ago

@sbrunk @mandar2812 Sorry I actually figured out what happened. The issue has been fixed but I didn't update the published artifacts correctly. I'm releasing new ones very soon. I'll post here when done.

mandar2812 commented 6 years ago

@eaplatanios: Perhaps you can also describe how you got your gpu setup to work. Assuming I compiled tensorflow with cuda support from source. What gcc version and compiler flags would I need to reproduce your working setup, more or less. This might also help as a temporary workaround in case of future issues like this.

eaplatanios commented 6 years ago

@mandar2812 If you compile both TensorFlow and my code with GCC 6.3 and provide the -D_GLIBCXX_USE_CXX11_ABI=0 compiler flag, all should be good. :)

eaplatanios commented 6 years ago

Generally, if you compile TF on your system things should work. The issue comes up with the pre-compiled binaries.

eaplatanios commented 6 years ago

@sbrunk @mandar2812 Could you please check again with 0.1.1-SNAPSHOT?

sbrunk commented 6 years ago

Works for me (CPU and GPU)! 😄

eaplatanios commented 6 years ago

@sbrunk You cannot imagine how good that sentence just sounded :P Once @mandar2812 confirms it all works fine for him too I'll close this issue and release version 0.1.1.

mandar2812 commented 6 years ago

@eaplatanios I'm on my way to office, will try the new snapshot as soon as I get there!

eaplatanios commented 6 years ago

@mandar2812 Sounds good! Thanks a lot! :)

mandar2812 commented 6 years ago

@eaplatanios I can verify that the GPU code works now with prebuilt TF binaries! I am not verifying CPU on linux as of now and will take @sbrunk's word for it! Lets close this!

LorenzBuehmann commented 6 years ago

Latest version 0.1.2-SNAPSHOT leads to the following error when running MNIST example:

Extracting images from file 'datasets/MNIST/train-images-idx3-ubyte.gz'.
Extracting the 'tensorflow_framework' native library to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_framework.so.
Copied 16030472 bytes to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_framework.so.
Extracting the 'tensorflow' native library to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow.so.
Copied 52157360 bytes to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow.so.
Extracting the 'tensorflow_jni' native library to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_jni.so.
Copied 686936 bytes to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_jni.so.
Extracting the 'tensorflow_ops' native library to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_ops.so.
Copied 133344 bytes to /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_ops.so.
2018-03-19 11:11:51.165338: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Extracting labels from file 'datasets/MNIST/train-labels-idx1-ubyte.gz'.
Extracting images from file 'datasets/MNIST/t10k-images-idx3-ubyte.gz'.
Extracting labels from file 'datasets/MNIST/t10k-labels-idx1-ubyte.gz'.
Building the logistic regression model.
Training the linear regression model.
/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java: symbol lookup error: /tmp/tensorflow_scala_native_libraries657698664943866475/libtensorflow_jni.so: undefined symbol: TF_TryEvaluateConstant

Maven deps:

             <dependency>
                <groupId>org.platanios</groupId>
                <artifactId>tensorflow_${scala.binary.version}</artifactId>
                <version>0.1.2-SNAPSHOT</version>
                <classifier>linux-cpu-x86_64</classifier>
            </dependency>
            <dependency>
                <groupId>org.platanios</groupId>
                <artifactId>tensorflow-data_${scala.binary.version}</artifactId>
                <version>0.1.2-SNAPSHOT</version>
            </dependency>

It also happens when I use the compiled Tensorflow lib built from source and add it via LD_LIBRARY_PATH param.

I can see that the flag was added to Tensorflow 13 days ago, thus, it might be because of the dependency to 1.7.0-RC0 version and you didn't rebuilt the JNI libs?

With version 0.1.1 it works as expected.