Closed frauzufall closed 5 years ago
input
and if you run the plugin, it should either crash or allocate way too much RAM until it swaps memory and eventually freezes.watch -n .2 nvidia-smi
There are two different way native libs work: the "bare" way in lib or libs embedded in JAR file(s). TensorFlow opts for the latter. It might work to do what you were trying to do, but I'd need more details of what you tried. An ImageJ Forum post is a good way to proceed.
I have now looked a bit into this and found the following:
The java tensorflow API...
System.loadLibrary
(For that it needs to be in the java.library.path)I think it would be best if we just load the native library before tensorflow does it. We can do this like in https://imagej.net/Developing_using_native_libraries#Support_in_ImageJ and put the library into \<ImageJ-directory>/lib/\<platform>/ (This can later be done by the update site).
I started testing this and am pretty sure I got the native library with GPU support loaded but then I got the following error:
org.scijava.module.MethodCallException: Error executing method: mpicbg.csbd.CSBDeep#modelInitialized
at org.scijava.module.MethodRef.execute(MethodRef.java:74)
at org.scijava.module.AbstractModuleItem.initialize(AbstractModuleItem.java:202)
at org.scijava.module.AbstractModule.initialize(AbstractModule.java:95)
at org.scijava.module.process.InitPreprocessor.process(InitPreprocessor.java:62)
at org.scijava.module.ModuleRunner.preProcess(ModuleRunner.java:105)
at org.scijava.module.ModuleRunner.run(ModuleRunner.java:157)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.scijava.module.MethodRef.execute(MethodRef.java:70)
... 12 more
Caused by: java.lang.IllegalArgumentException: NodeDef mentions attr 'data_format' not in Op<name=Conv3D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT64, DT_INT32, DT_UINT8, DT_UINT16, DT_INT16, DT_INT8, DT_COMPLEX64, DT_COMPLEX128, DT_QINT8, DT_QUINT8, DT_QINT32, DT_HALF]; attr=strides:list(int),min=5; attr=padding:string,allowed=["SAME", "VALID"]>; NodeDef: conv3d_1/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1]](input, conv3d_1/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
at org.tensorflow.Graph.importGraphDef(Native Method)
at org.tensorflow.Graph.importGraphDef(Graph.java:118)
at org.tensorflow.Graph.importGraphDef(Graph.java:102)
at mpicbg.csbd.TensorFlowService.loadGraph(TensorFlowService.java:79)
at mpicbg.csbd.CSBDeep.loadGraph(CSBDeep.java:156)
at mpicbg.csbd.CSBDeep.modelChanged(CSBDeep.java:228)
at mpicbg.csbd.CSBDeep.modelInitialized(CSBDeep.java:210)
... 17 more
@HedgehogCode What tf version did you use? I was building the protobuf models with tf version 1.2.0. Might be a version conflict?
@maweigert I tried it with 1.2.1 and 1.3.0. I will try 1.2.0 tomorrow. Thank you!
@HedgehogCode Nice! I tried the same before but couldn't get it to load the lib. How did you name it? With Linux, I tried Fiji.app/lib/linux64/libtensorflow.so
+ other variations and had to luck. Do you have access to the deep learning workplace? This is Ubuntu, I think, you could test it there as well.
@frauzufall I named it Fiji.app/lib/macosx/libtensorflow_jni.dylib
and Fiji.app/lib/linux64/libtensorflow_jni.so
should work on Linux. The important thing is to use JNI.loadLibrary("libtensorflow_jni");
and not System.loadLibrary("libtensorflow_jni");
because otherwise the folder is not on the library path (Tobias Pietzsch pointed that out for me).
Yes, I have access to the deep learning workplace. I will try it there.
I found out, that I can download different versions of the native library with GPU support for mac but always get the same file with version 1.1.0 (because they dropped official GPU support for mac after this version)
At the deep learning workplace this is not an issue. I needed some time to set it up and now get this error:
[WARNING] ShadowMenu: menu item already exists:
existing: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
ignored: [Plugins, CSBDeep] : mpicbg.csbd.CSBDeep [file:/home/wilhelm/Apps/Fiji.app/jars/CSBDeep-0.1.0-SNAPSHOT.jar]
setmappingdefaults
input node with name input_ not found
percentiles: 0.13636364 -> 0.6818182
factor: 183.33333
executeInceptionGraph
could not create output dataset
org.tensorflow.TensorFlowException: Failed to create session.
at org.tensorflow.Session.allocate(Native Method)
at org.tensorflow.Session.<init>(Session.java:70)
at org.tensorflow.Session.<init>(Session.java:52)
at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:453)
at mpicbg.csbd.CSBDeep.run(CSBDeep.java:309)
at org.scijava.command.CommandModule.run(CommandModule.java:199)
at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
at org.scijava.command.CommandModule.run(CommandModule.java:199)
at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I'll try to debug it right now. I just wanted to update where I am.
Edit: To reproduce this:
enh/gpu-support
Fiji.app/lib/linux64/
LD_LIBRARY_PATH
if necessary)Cool, got it working by using your branch, downloading https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-gpu-linux-x86_64-1.2.0.tar.gz, unpacking it into Fiji.app/lib/linux64/libtensorflow_jni.so
. I was running the net_project
example, but had to resize the stack to 688x512px.
I will discuss the issues I have with the networks and example data with @uschmidt83 and/or @maweigert in the data repository.
EDIT: I played around with it for a while and now I get a similar error as you did.....
java.lang.IllegalStateException: OOM when allocating tensor with shape[1,8,50,512,688]
[[Node: conv3d_5/convolution = Conv3D[T=DT_FLOAT, data_format="NDHWC", padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](up_sampling3d_2/Reshape_2, conv3d_5/kernel/read)]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:295)
at org.tensorflow.Session$Runner.run(Session.java:245)
at mpicbg.csbd.CSBDeep.executeGraph(CSBDeep.java:482)
at mpicbg.csbd.CSBDeep.run(CSBDeep.java:310)
at org.scijava.command.CommandModule.run(CommandModule.java:199)
at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[ERROR] Module threw exception
java.lang.NullPointerException
at mpicbg.csbd.CSBDeep.run(CSBDeep.java:311)
at org.scijava.command.CommandModule.run(CommandModule.java:199)
at org.scijava.module.ModuleRunner.run(ModuleRunner.java:168)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:127)
at org.scijava.module.ModuleRunner.call(ModuleRunner.java:66)
at org.scijava.thread.DefaultThreadService$3.call(DefaultThreadService.java:238)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
No idea why it worked before -.- But here is screenshot proof
I got it to work as well. The deep learning machine has two GPUs and tensorflow tried to use the the wrong one. I set the CUDA_VISIBLE_DEVICES environment variable to 0 (id of the Titan Xp) and it worked. (I have no idea how to choose the GPU in java)
The tensorflow version from the maven repo is CPU only, but our networks rely on GPU. We need a simple solution.