CSBDeep / CSBDeep_fiji

BSD 2-Clause "Simplified" License
11 stars 4 forks source link

GPU support #1

Closed frauzufall closed 5 years ago

frauzufall commented 7 years ago

The tensorflow version from the maven repo is CPU only, but our networks rely on GPU. We need a simple solution.

frauzufall commented 7 years ago

How to test:

This is what I got so far:

General discussion

Java Native Libraries

There are two different way native libs work: the "bare" way in lib or libs embedded in JAR file(s). TensorFlow opts for the latter. It might work to do what you were trying to do, but I'd need more details of what you tried. An ImageJ Forum post is a good way to proceed.

How to pack native libraries for multiple OS

HedgehogCode commented 7 years ago

I have now looked a bit into this and found the following:

The java tensorflow API...

I think it would be best if we just load the native library before tensorflow does it. We can do this like in https://imagej.net/Developing_using_native_libraries#Support_in_ImageJ and put the library into \<ImageJ-directory>/lib/\<platform>/ (This can later be done by the update site).

I started testing this and am pretty sure I got the native library with GPU support loaded but then I got the following error:

org.scijava.module.MethodCallException: Error executing method: mpicbg.csbd.CSBDeep#modelInitialized
    at org.scijava.module.MethodRef.execute(MethodRef.java:74)
    at org.scijava.module.AbstractModuleItem.initialize(AbstractModuleItem.java:202)
    at org.scijava.module.AbstractModule.initialize(AbstractModule.java:95)
    at org.scijava.module.process.InitPreprocessor.process(InitPreprocessor.java:62)
    at org.scijava.module.ModuleRunner.preProcess(ModuleRunner.java:105)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:157)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.scijava.module.MethodRef.execute(MethodRef.java:70)
    ... 12 more
Caused by: java.lang.IllegalArgumentException: NodeDef mentions attr 'data_format' not in Op<name=Conv3D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT64, DT_INT32, DT_UINT8, DT_UINT16, DT_INT16, DT_INT8, DT_COMPLEX64, DT_COMPLEX128, DT_QINT8, DT_QUINT8, DT_QINT32, DT_HALF]; attr=strides:list(int),min=5; attr=padding:string,allowed=["SAME", "VALID"]>; NodeDef: conv3d_1/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1]](input, conv3d_1/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
    at org.tensorflow.Graph.importGraphDef(Native Method)
    at org.tensorflow.Graph.importGraphDef(Graph.java:118)
    at org.tensorflow.Graph.importGraphDef(Graph.java:102)
    at mpicbg.csbd.TensorFlowService.loadGraph(TensorFlowService.java:79)
    at mpicbg.csbd.CSBDeep.loadGraph(CSBDeep.java:156)
    at mpicbg.csbd.CSBDeep.modelChanged(CSBDeep.java:228)
    at mpicbg.csbd.CSBDeep.modelInitialized(CSBDeep.java:210)
    ... 17 more
maweigert commented 7 years ago

@HedgehogCode What tf version did you use? I was building the protobuf models with tf version 1.2.0. Might be a version conflict?

HedgehogCode commented 7 years ago

@maweigert I tried it with 1.2.1 and 1.3.0. I will try 1.2.0 tomorrow. Thank you!

frauzufall commented 7 years ago

@HedgehogCode Nice! I tried the same before but couldn't get it to load the lib. How did you name it? With Linux, I tried Fiji.app/lib/linux64/libtensorflow.so + other variations and had to luck. Do you have access to the deep learning workplace? This is Ubuntu, I think, you could test it there as well.

HedgehogCode commented 7 years ago

@frauzufall I named it Fiji.app/lib/macosx/libtensorflow_jni.dylib and Fiji.app/lib/linux64/libtensorflow_jni.so should work on Linux. The important thing is to use JNI.loadLibrary("libtensorflow_jni"); and not System.loadLibrary("libtensorflow_jni"); because otherwise the folder is not on the library path (Tobias Pietzsch pointed that out for me). Yes, I have access to the deep learning workplace. I will try it there.

HedgehogCode commented 7 years ago

I found out, that I can download different versions of the native library with GPU support for mac but always get the same file with version 1.1.0 (because they dropped official GPU support for mac after this version)

At the deep learning workplace this is not an issue. I needed some time to set it up and now get this error:

[WARNING] ShadowMenu: menu item already exists:
    existing: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
     ignored: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
setmappingdefaults
input node with name input_ not found
percentiles: 0.13636364 -> 0.6818182
factor: 183.33333
executeInceptionGraph
could not create output dataset
org.tensorflow.TensorFlowException: Failed to create session.
    at org.tensorflow.Session.allocate(Native Method)
    at org.tensorflow.Session.<init>(Session.java:70)
    at org.tensorflow.Session.<init>(Session.java:52)
    at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:453)
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:309)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I'll try to debug it right now. I just wanted to update where I am.

Edit: To reproduce this:

frauzufall commented 7 years ago

Cool, got it working by using your branch, downloading https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-gpu-linux-x86_64-1.2.0.tar.gz, unpacking it into Fiji.app/lib/linux64/libtensorflow_jni.so. I was running the net_project example, but had to resize the stack to 688x512px. I will discuss the issues I have with the networks and example data with @uschmidt83 and/or @maweigert in the data repository.

EDIT: I played around with it for a while and now I get a similar error as you did.....

java.lang.IllegalStateException: OOM when allocating tensor with shape[1,8,50,512,688]
     [[Node: conv3d_5/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](up_sampling3d_2/Reshape_2, conv3d_5/kernel/read)]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:295)
    at org.tensorflow.Session$Runner.run(Session.java:245)
    at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:482)
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
    at mpicbg.csbd.CSBDeep.run(CSBDeep.java:311)
    at org.scijava.command.CommandModule.run(CommandModule.java:199)
    at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
    at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
    at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

No idea why it worked before -.- But here is screenshot proof screenshot_2017-09-13_23-41-04

HedgehogCode commented 7 years ago

I got it to work as well. The deep learning machine has two GPUs and tensorflow tried to use the the wrong one. I set the CUDA_VISIBLE_DEVICES environment variable to 0 (id of the Titan Xp) and it worked. (I have no idea how to choose the GPU in java)