GPU support - Githubissues

frauzufall commented 7 years ago

The tensorflow version from the maven repo is CPU only, but our networks rely on GPU. We need a simple solution.

frauzufall commented 7 years ago

How to test:

use https://github.com/frauzufall/CSBDeep-data/tree/master/net_project - the input node name is input and if you run the plugin, it should either crash or allocate way too much RAM until it swaps memory and eventually freezes.
how to see if the GPU is used [Linux]:
```
watch -n .2 nvidia-smi
```

This is what I got so far:

General discussion

discussion about the CPU-only maven repo: https://github.com/tensorflow/tensorflow/issues/6926

Java Native Libraries

Download: https://www.tensorflow.org/install/install_java (most relevant probably the part "Using TensorFlow with JDK" with the link to the tensorflow JNI - libtensorflow.so)
Integration: https://imagej.net/Developing_using_native_libraries
I asked Curtis about including the native library with GPU support in ImageJ and he wrote:

There are two different way native libs work: the "bare" way in lib or libs embedded in JAR file(s). TensorFlow opts for the latter. It might work to do what you were trying to do, but I'd need more details of what you tried. An ImageJ Forum post is a good way to proceed.

How to pack native libraries for multiple OS

https://github.com/scijava/native-lib-loader (did not work at first try, but did not look into it very much, code is commented out)
https://github.com/maven-nar/nar-maven-plugin

HedgehogCode commented 7 years ago

I have now looked a bit into this and found the following:

The java tensorflow API...

checks if the native library has already been loaded
tries to load it from via System.loadLibrary (For that it needs to be in the java.library.path)
unpacks a jar file containing the library and loads it

I think it would be best if we just load the native library before tensorflow does it. We can do this like in https://imagej.net/Developing_using_native_libraries#Support_in_ImageJ and put the library into \<ImageJ-directory>/lib/\<platform>/ (This can later be done by the update site).

I started testing this and am pretty sure I got the native library with GPU support loaded but then I got the following error:

org.scijava.module.MethodCallException: Error executing method: mpicbg.csbd.CSBDeep#modelInitialized
    at org.scijava.module.MethodRef.execute(MethodRef.java:74)
    at org.scijava.module.AbstractModuleItem.initialize(AbstractModuleItem.java:202)
    at org.scijava.module.AbstractModule.initialize(AbstractModule.java:95)
    at org.scijava.module.process.InitPreprocessor.process(InitPreprocessor.java:62)
    at org.scijava.module.ModuleRunner.preProcess(ModuleRunner.java:105)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:157)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.scijava.module.MethodRef.execute(MethodRef.java:70)
    ... 12 more
Caused by: java.lang.IllegalArgumentException: NodeDef mentions attr 'data_format' not in Op<name=Conv3D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT64, DT_INT32, DT_UINT8, DT_UINT16, DT_INT16, DT_INT8, DT_COMPLEX64, DT_COMPLEX128, DT_QINT8, DT_QUINT8, DT_QINT32, DT_HALF]; attr=strides:list(int),min=5; attr=padding:string,allowed=["SAME", "VALID"]>; NodeDef: conv3d_1/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1]](input, conv3d_1/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
    at org.tensorflow.Graph.importGraphDef(Native Method)
    at org.tensorflow.Graph.importGraphDef(Graph.java:118)
    at org.tensorflow.Graph.importGraphDef(Graph.java:102)
    at mpicbg.csbd.TensorFlowService.loadGraph(TensorFlowService.java:79)
    at mpicbg.csbd.CSBDeep.loadGraph(CSBDeep.java:156)
    at mpicbg.csbd.CSBDeep.modelChanged(CSBDeep.java:228)
    at mpicbg.csbd.CSBDeep.modelInitialized(CSBDeep.java:210)
    ... 17 more

maweigert commented 7 years ago

@HedgehogCode What tf version did you use? I was building the protobuf models with tf version 1.2.0. Might be a version conflict?

HedgehogCode commented 7 years ago

@maweigert I tried it with 1.2.1 and 1.3.0. I will try 1.2.0 tomorrow. Thank you!

frauzufall commented 7 years ago

@HedgehogCode Nice! I tried the same before but couldn't get it to load the lib. How did you name it? With Linux, I tried Fiji.app/lib/linux64/libtensorflow.so + other variations and had to luck. Do you have access to the deep learning workplace? This is Ubuntu, I think, you could test it there as well.

HedgehogCode commented 7 years ago

@frauzufall I named it Fiji.app/lib/macosx/libtensorflow_jni.dylib and Fiji.app/lib/linux64/libtensorflow_jni.so should work on Linux. The important thing is to use JNI.loadLibrary("libtensorflow_jni"); and not System.loadLibrary("libtensorflow_jni"); because otherwise the folder is not on the library path (Tobias Pietzsch pointed that out for me). Yes, I have access to the deep learning workplace. I will try it there.

HedgehogCode commented 7 years ago

I found out, that I can download different versions of the native library with GPU support for mac but always get the same file with version 1.1.0 (because they dropped official GPU support for mac after this version)

At the deep learning workplace this is not an issue. I needed some time to set it up and now get this error:

[WARNING] ShadowMenu: menu item already exists:
    existing: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
     ignored: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
setmappingdefaults
input node with name input_ not found
percentiles: 0.13636364 -> 0.6818182
factor: 183.33333
executeInceptionGraph
could not create output dataset
org.tensorflow.TensorFlowException: Failed to create session.
    at org.tensorflow.Session.allocate(Native Method)
    at org.tensorflow.Session.<init>(Session.java:70)
    at org.tensorflow.Session.<init>(Session.java:52)
    at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:453)
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:309)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I'll try to debug it right now. I just wanted to update where I am.

Edit: To reproduce this:

Use the branch enh/gpu-support
Build with maven into an existing Fiji installation
Download the native library using the links from https://www.tensorflow.org/install/install_java and put it into Fiji.app/lib/linux64/
Make sure cuda and cuDNN 5.1 is in the library path (set LD_LIBRARY_PATH if necessary)
Try to run a model

frauzufall commented 7 years ago

Cool, got it working by using your branch, downloading https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-gpu-linux-x86_64-1.2.0.tar.gz, unpacking it into Fiji.app/lib/linux64/libtensorflow_jni.so. I was running the net_project example, but had to resize the stack to 688x512px. I will discuss the issues I have with the networks and example data with @uschmidt83 and/or @maweigert in the data repository.

EDIT: I played around with it for a while and now I get a similar error as you did.....

java.lang.IllegalStateException: OOM when allocating tensor with shape[1,8,50,512,688]
     [[Node: conv3d_5/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](up_sampling3d_2/Reshape_2, conv3d_5/kernel/read)]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:295)
    at org.tensorflow.Session$Runner.run(Session.java:245)
    at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:482)
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:311)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

No idea why it worked before -.- But here is screenshot proof screenshot_2017-09-13_23-41-04

HedgehogCode commented 7 years ago

I got it to work as well. The deep learning machine has two GPUs and tensorflow tried to use the the wrong one. I set the CUDA_VISIBLE_DEVICES environment variable to 0 (id of the Titan Xp) and it worked. (I have no idea how to choose the GPU in java)

CSBDeep / CSBDeep_fiji

GPU support #1

How to test:

This is what I got so far:

General discussion

Java Native Libraries

How to pack native libraries for multiple OS