edwardyehuang / CAR

CAR: Class-aware Regularizations for Semantic Segmentation (ECCV-2022)
MIT License
29 stars 6 forks source link

UNIMPLEMENTED: DNN library is not found. #1

Closed Z1740220020 closed 2 years ago

Z1740220020 commented 2 years ago

Hello, I'm very interested in your work, but when I run train Py, I received the following error. In addition, I used the CONDA environment and installed TF2.8 and cudnn8.1

2022-05-05 10:33:01.148738: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found. Traceback (most recent call last): File "train.py", line 66, in <module> app.run(train) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 45, in train training.train( File "/tmpnfs/junli/CAR/iseg/core_train.py", line 142, in train model.fit( File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

……

File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 248, in call outputs = self.convolution_op(inputs, self.kernel) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 233, in convolution_op return tf.nn.convolution( Node: 'seg_managed/resnet50/conv1_1_conv/Conv2D' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] [[div_no_nan_4/ReadVariableOp/_418]] (1) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_40875]

edwardyehuang commented 2 years ago

Hello, I'm very interested in your work, but when I run train Py, I received the following error. In addition, I used the CONDA environment and installed TF2.8 and cudnn8.1

2022-05-05 10:33:01.148738: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found. Traceback (most recent call last): File "train.py", line 66, in <module> app.run(train) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 45, in train training.train( File "/tmpnfs/junli/CAR/iseg/core_train.py", line 142, in train model.fit( File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

……

File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 248, in call outputs = self.convolution_op(inputs, self.kernel) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 233, in convolution_op return tf.nn.convolution( Node: 'seg_managed/resnet50/conv1_1_conv/Conv2D' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] [[div_no_nan_4/ReadVariableOp/_418]] (1) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_40875]

Hi, could you please paste more logs? Especially the outputs in the startup of the program.

edwardyehuang commented 2 years ago

Also, please test the following code in the conda environment, and paste the output here:

import tensorflow as tf

print(tf.__version__)

tf.constant(1) + 2
Z1740220020 commented 2 years ago

Also, please test the following code in the conda environment, and paste the output here:

import tensorflow as tf

print(tf.__version__)

tf.constant(1) + 2

`Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf print(tf.version) 2.8.0 tf.constant(1)+2 2022-05-05 11:26:58.035535: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-05-05 11:27:09.540225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1390 MB memory: -> device: 0, name: GeForce RTX 3090, pci bus id: 0000:1a:00.0, compute capability: 8.6 2022-05-05 11:27:09.585875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1712 MB memory: -> device: 1, name: GeForce RTX 3090, pci bus id: 0000:1b:00.0, compute capability: 8.6 2022-05-05 11:27:09.587307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 1738 MB memory: -> device: 2, name: GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6 2022-05-05 11:27:09.588804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 1738 MB memory: -> device: 3, name: GeForce RTX 3090, pci bus id: 0000:3e:00.0, compute capability: 8.6 2022-05-05 11:27:09.590385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 1658 MB memory: -> device: 4, name: GeForce RTX 3090, pci bus id: 0000:88:00.0, compute capability: 8.6 2022-05-05 11:27:09.591782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 1738 MB memory: -> device: 5, name: GeForce RTX 3090, pci bus id: 0000:89:00.0, compute capability: 8.6 2022-05-05 11:27:09.593622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 22310 MB memory: -> device: 6, name: GeForce RTX 3090, pci bus id: 0000:b1:00.0, compute capability: 8.6 2022-05-05 11:27:09.596372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 22310 MB memory: -> device: 7, name: GeForce RTX 3090, pci bus id: 0000:b2:00.0, compute capability: 8.6

`
Z1740220020 commented 2 years ago

Hello, I'm very interested in your work, but when I run train Py, I received the following error. In addition, I used the CONDA environment and installed TF2.8 and cudnn8.1 2022-05-05 10:33:01.148738: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found. Traceback (most recent call last): File "train.py", line 66, in <module> app.run(train) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 45, in train training.train( File "/tmpnfs/junli/CAR/iseg/core_train.py", line 142, in train model.fit( File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error: …… File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 248, in call outputs = self.convolution_op(inputs, self.kernel) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 233, in convolution_op return tf.nn.convolution( Node: 'seg_managed/resnet50/conv1_1_conv/Conv2D' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] [[div_no_nan_4/ReadVariableOp/_418]] (1) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_40875]

Hi, could you please paste more logs? Especially the outputs in the startup of the program.

`Use the random seed "0" 2022-05-05 10:29:39.320782: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-05-05 10:29:41.756863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22310 MB memory: -> device: 0, name: GeForce RTX 3090, pci bus id: 0000:b2:00.0, compute capability: 8.6 INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) I0505 10:29:41.934745 139990914467584 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: GeForce RTX 3090, compute capability 8.6 I0505 10:29:41.941570 139990914467584 device_compatibility_check.py:117] Mixed precision compatibility check (mixed_float16): OK Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: GeForce RTX 3090, compute capability 8.6 Processed augments = ['RandomScaleAugment', 'RandomBrightnessAugment', 'PadAugment', 'RandomCropAugment', 'RandomFlipAugment'] Processed augments = ['PadAugment'] Processed augments = ['RandomScaleAugment', 'RandomBrightnessAugment', 'PadAugment', 'RandomCropAugment', 'RandomFlipAugment'] Processed augments = ['PadAugment'] ------General settings------ ------head_name = nl ------apply_car = True ------apply_car_convs = True ------use_multi_lr = False ------use_aux_loss = False ------aux_loss_rate = 0.2

------Baseline settings------ ------train_mode = True ------baseline_mode = False ------replace_2nd_last_conv = True

------CAR settings------ ------train_mode = True ------use_intra_class_loss = True ------use_inter_class_loss = True ------intra_class_loss_rate = 1.0 ------inter_class_loss_rate = 1.0 ------use_batch_class_center = True ------use_last_class_center = False ------last_class_center_decay = 0.9 ------pooling_rates = [1] ------inter_c2c_loss_threshold = 0.5 ------inter_c2p_loss_threshold = 0.25 ------intra_class_loss_remove_max = False ------use_inter_c2c_loss = True ------use_inter_c2p_loss = True ------filters = 512 ------apply_convs = True ------num_class = 59 ------ignore_label = 0 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.385704 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.391795 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.393509 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.394266 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.396034 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.398394 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.412589 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.413255 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.414587 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0505 10:29:45.415239 139990914467584 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). Load backbone weights models/resnet50_bn.h5 as H5 format Epoch 1/30 car channels = 512 Using self-loss WARNING:tensorflow:AutoGraph could not transform <bound method SegMetricWrapper.update_state of <iseg.metrics.seg_metric_wrapper.SegMetricWrapper object at 0x7f50a4098940>> and will run it as-is. Cause: mangled names are not yet supported To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert W0505 10:30:02.101555 139958505412352 ag_logging.py:142] AutoGraph could not transform <bound method SegMetricWrapper.update_state of <iseg.metrics.seg_metric_wrapper.SegMetricWrapper object at 0x7f50a4098940>> and will run it as-is. Cause: mangled names are not yet supported To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert Using self-loss 2022-05-05 10:30:31.363184: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 333 of 4998 2022-05-05 10:30:41.337099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 674 of 4998 2022-05-05 10:30:52.537012: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 860 of 4998 2022-05-05 10:31:01.348961: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 1164 of 4998 2022-05-05 10:31:11.360664: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 1373 of 4998 2022-05-05 10:31:21.350970: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 1679 of 4998 2022-05-05 10:31:31.354664: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 1994 of 4998 2022-05-05 10:31:41.341893: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 2366 of 4998 2022-05-05 10:31:51.435053: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 2750 of 4998 2022-05-05 10:32:01.347929: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 3108 of 4998 2022-05-05 10:32:11.345512: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 3436 of 4998 2022-05-05 10:32:21.352060: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 3772 of 4998 2022-05-05 10:32:31.412115: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 4116 of 4998 2022-05-05 10:32:41.342272: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 4388 of 4998 2022-05-05 10:32:51.357047: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 4726 of 4998 2022-05-05 10:32:59.413482: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:415] Shuffle buffer filled. 2022-05-05 10:33:01.127918: E tensorflow/stream_executor/cuda/cuda_dnn.cc:361] Loaded runtime CuDNN library: 8.0.4 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-05-05 10:33:01.148738: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found. Traceback (most recent call last): File "train.py", line 66, in app.run(train) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 45, in train training.train( File "/tmpnfs/junli/CAR/iseg/core_train.py", line 142, in train model.fit( File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'seg_managed/resnet50/conv1_1_conv/Conv2D' defined at (most recent call last): File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step outputs = model.train_step(data) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/training.py", line 859, in train_step y_pred = self(x, training=True) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, *args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(args, kwargs) File "/tmpnfs/junli/CAR/iseg/layers/core_model_ext.py", line 105, in call endpoints = self.backbone(backbone_inputs, training=training) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, *args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(args, kwargs) File "/tmpnfs/junli/CAR/iseg/backbones/resnet_common.py", line 193, in call x = conv1_fn(inputs, training=training, kwargs) File "/tmpnfs/junli/CAR/iseg/backbones/resnet_common.py", line 173, in compute_3x3_resnet x = self.conv1_1_conv(inputs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 248, in call outputs = self.convolution_op(inputs, self.kernel) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 233, in convolution_op return tf.nn.convolution( Node: 'seg_managed/resnet50/conv1_1_conv/Conv2D' Detected at node 'seg_managed/resnet50/conv1_1_conv/Conv2D' defined at (most recent call last): File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step outputs = model.train_step(data) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/training.py", line 859, in train_step y_pred = self(x, training=True) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(*args, kwargs) File "/tmpnfs/junli/CAR/iseg/layers/core_model_ext.py", line 105, in call endpoints = self.backbone(backbone_inputs, training=training) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(args, kwargs) File "/tmpnfs/junli/CAR/iseg/backbones/resnet_common.py", line 193, in call x = conv1_fn(inputs, training=training, kwargs) File "/tmpnfs/junli/CAR/iseg/backbones/resnet_common.py", line 173, in compute_3x3_resnet x = self.conv1_1_conv(inputs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1096, in call outputs = call_fn(inputs, *args, *kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler return fn(args, kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 248, in call outputs = self.convolution_op(inputs, self.kernel) File "/home/CN/zizhang.wu/anaconda3/envs/jlcar/lib/python3.8/site-packages/keras/layers/convolutional.py", line 233, in convolution_op return tf.nn.convolution( Node: 'seg_managed/resnet50/conv1_1_conv/Conv2D' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] [[div_no_nan_4/ReadVariableOp/_418]] (1) UNIMPLEMENTED: DNN library is not found. [[{{node seg_managed/resnet50/conv1_1_conv/Conv2D}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_40875] `

edwardyehuang commented 2 years ago

2022-05-05 10:33:01.127918: E tensorflow/stream_executor/cuda/cuda_dnn.cc:361] Loaded runtime CuDNN library: 8.0.4 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

The TensorFlow still loaded cudnn 8.0.4. For TensorFlow 2.8, cudnn 8.1.0 is required.

Please follow https://github.com/edwardyehuang/CAR/blob/master/docs/install_tf28.md to install cudnn=8.1.0, and export the path to cudnn.

Z1740220020 commented 2 years ago

2022-05-05 10:33:01.127918: E tensorflow/stream_executor/cuda/cuda_dnn.cc:361] Loaded runtime CuDNN library: 8.0.4 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

The TensorFlow still loaded cudnn 8.0.4. For TensorFlow 2.8, cudnn 8.1.0 is required.

Please follow https://github.com/edwardyehuang/CAR/blob/master/docs/install_tf28.md to install cudnn=8.1.0, and export the path to cudnn.

Many thanks to the author for his patient guidance. This solved my problem, and now the code works fine.