NeuromorphicProcessorProject / snn_toolbox

Toolbox for converting analog to spiking neural networks (ANN to SNN), and running them in a spiking neuron simulator.
MIT License
360 stars 104 forks source link

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. #90

Closed wangxiao5791509 closed 3 years ago

wangxiao5791509 commented 3 years ago

$ python mnist_pytorch_INI.py

Test accuracy: 95.37%

Initializing INI simulator...

Loading data set from '.npz' files in /home/wangxiao/Documents/siamfc-pytorch/temp/1618541272.721811.

Pytorch model was successfully ported to ONNX. 2021-04-16 10:48:09.440141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-04-16 10:48:09.440267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.440862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:01:00.0 2021-04-16 10:48:09.440913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.441569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:03:00.0 2021-04-16 10:48:09.441778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-04-16 10:48:09.442850: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-04-16 10:48:09.443755: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-04-16 10:48:09.443995: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-04-16 10:48:09.445239: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-04-16 10:48:09.446193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-04-16 10:48:09.448816: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-04-16 10:48:09.448909: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.449519: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.450113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.450690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.451233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1 2021-04-16 10:48:09.451472: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-04-16 10:48:09.478616: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2021-04-16 10:48:09.479184: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557347a0d5f0 executing computations on platform Host. Devices: 2021-04-16 10:48:09.479204: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2021-04-16 10:48:09.646857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.647184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:01:00.0 2021-04-16 10:48:09.647258: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.647717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:03:00.0 2021-04-16 10:48:09.647744: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-04-16 10:48:09.647753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-04-16 10:48:09.647762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-04-16 10:48:09.647770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-04-16 10:48:09.647778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-04-16 10:48:09.647785: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-04-16 10:48:09.647793: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-04-16 10:48:09.647893: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.648456: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.648923: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.649441: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:09.649898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1 2021-04-16 10:48:12.080253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-04-16 10:48:12.080278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 1 2021-04-16 10:48:12.080283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N N 2021-04-16 10:48:12.080286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1: N N 2021-04-16 10:48:12.080495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.080965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.081440: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.081920: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.082342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9114 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5) 2021-04-16 10:48:12.082581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.083030: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-16 10:48:12.083480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9663 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5) 2021-04-16 10:48:12.085506: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5573c4cf36a0 executing computations on platform CUDA. Devices: 2021-04-16 10:48:12.085536: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 2021-04-16 10:48:12.085541: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5 [0, 0, 0, 0, 0, 0, 0, 0] Unable to use same padding. Add ZeroPadding2D layer to fix shapes. 2021-04-16 10:48:12.520943: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-04-16 10:48:12.652267: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-04-16 10:48:13.117081: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2021-04-16 10:48:13.130173: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2021-04-16 10:48:13.130234: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node model/11/Conv2D}}]] Traceback (most recent call last): File "mnist_pytorch_INI.py", line 189, in main(config_filepath) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/snntoolbox/bin/run.py", line 31, in main run_pipeline(config) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/snntoolbox/bin/utils.py", line 73, in run_pipeline config.get('paths', 'filename_ann')) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/snntoolbox/parsing/model_libs/pytorch_input_lib.py", line 129, in load output_keras = model_keras.predict(input_numpy) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 909, in predict use_multiprocessing=use_multiprocessing) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 462, in predict steps=steps, callbacks=callbacks, *kwargs) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 444, in _model_iteration total_epochs=1) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch batch_outs = execution_function(iterator) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function distributed_function(input_fn)) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in call result = self._call(args, **kwds) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call self.captured_inputs) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call ctx=ctx) File "/home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model/11/Conv2D (defined at /home/wangxiao/anaconda3/envs/ann2snn/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_406]

Function call stack: distributed_function

================== Dividing Line =======================

Hi, thanks for providing this tool for the transformation. When I test the example code: mnist_pytorch_INI.py, I get the bug as mentioned above. My current setting is: tensorflow-gpu==2.0, cudnn==7.6.3. Do you have ideas on how to address this issue? Looking forward to your reply. Thanks.

wangxiao5791509 commented 3 years ago

I solve this problem by using gpu only, and tensorflow 2.2.