frankkramer-lab / MIScnn

A framework for Medical Image Segmentation with Convolutional Neural Networks and Deep Learning
GNU General Public License v3.0
402 stars 116 forks source link

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize #146

Closed Qcatbot closed 2 years ago

Qcatbot commented 2 years ago

Hello, I am getting the above error when I test kits19 dataset on Google Colab error messgae: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

Here is my environment; Tensorflow - 2.7.0 Python 3.7.13 CUDA Version: 11.2

Here is the cudnn output;

>> !cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 1
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

I tried to downgrade the tensorflow version to 1.x, But then I get MIScnn import error; ModuleNotFoundError: No module named 'tensorflow_core.keras'

I would appreciate if you could help me to figure out whats the problem. Thank you.

Qcatbot commented 2 years ago

Here is the error message; Epoch 1/100


UnknownError Traceback (most recent call last)

in () 2 3 sample_list = data_io.get_indiceslist() ----> 4 model.train(sample_list[0:50], epochs=100)

2 frames

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 57 ctx.ensure_initialized() 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, ---> 59 inputs, attrs, num_outputs) 60 except core._NotOkStatusException as e: 61 if name is not None:

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model/conv3d/Conv3D (defined at /usr/local/lib/python3.7/dist-packages/keras/layers/convolutional.py:238) ]] [Op:__inference_train_function_5792]

Errors may have originated from an input operation. Input Source operations connected to node model/conv3d/Conv3D: In[0] IteratorGetNext (defined at /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:866)
In[1] model/conv3d/Conv3D/ReadVariableOp:

Operation defined at: (most recent call last)

File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec)

File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)

File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance()

File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 846, in launch_instance app.start()

File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py", line 499, in start self.io_loop.start()

File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 132, in start self.asyncio_loop.run_forever()

File "/usr/lib/python3.7/asyncio/base_events.py", line 541, in run_forever self._run_once()

File "/usr/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once handle._run()

File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args)

File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 122, in _handle_events handler_func(fileobj, events)

File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events self._handle_recv()

File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 481, in _handle_recv self._run_callback(callback, msg)

File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 431, in _run_callback callback(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg)

File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell handler(stream, idents, msg)

File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin)

File "/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent)

File "/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result)

File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result):

File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 4, in model.train(sample_list[0:50], epochs=100)

File "/usr/local/lib/python3.7/dist-packages/miscnn/neural_network/model.py", line 137, in train max_queue_size=self.batch_queue_size)

File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1216, in fit tmp_logs = self.train_function(iterator)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 878, in train_function return step_function(self, iterator)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 867, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,))

File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 860, in run_step outputs = model.train_step(data)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 808, in train_step y_pred = self(x, training=True)

File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 1083, in call outputs = call_fn(inputs, *args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py", line 452, in call inputs, training=training, mask=mask)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py", line 589, in _run_internal_graph outputs = node.layer(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 1083, in call outputs = call_fn(inputs, *args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, **kwargs)

File "/usr/local/lib/python3.7/dist-packages/keras/layers/convolutional.py", line 246, in call outputs = self.convolution_op(inputs, self.kernel)

File "/usr/local/lib/python3.7/dist-packages/keras/layers/convolutional.py", line 238, in convolution_op name=self.class.name)

muellerdo commented 2 years ago

Hello @Qcatbot,

sadly, this is a known issue for Google Colab right now.

Google Colab utilizes a custom build Tensorflow 2.8 on a Python 3.7 version, which shouldn't be possible due to Tensorflow 2.8 is intended to be built on Python 3.8.

The current live version of MIScnn has a fixed version dependency of Tensorflow 2.7. So sadly, if you install miscnn via pip then the custom build version of Tensorflow in Google Colab breaks if you try to downgrade it to 2.7...

Try not to change the Tensorflow version in the Colab (so it's compatible with their CUDA version) and install MIScnn via dev branch.

pip install git+https://github.com/frankkramer-lab/MIScnn.git@development

The dev branch is more "dynamic" to differences in module versions (accepts >= 2.6 Tensorflow).

Then, it should work! :)

Cheers, Dominik

Qcatbot commented 2 years ago

Hi Dominik, Thank you very much for the clarification. I will try your suggestion.