IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

Using TFLMS and getting an error "cudnnSetTensorNdDescriptor" #60

Closed junaidjawaid1 closed 1 year ago

junaidjawaid1 commented 1 year ago

I am trying to apply TFLMS on the 3d U-Net I am training on a remote ssh server at my university. The GPU Nvidia A100 80gb, is partitioned into two 40gb GPUs virtually, and I am using one of them. I am getting the following error report, I would really appreciate any help or guidance, thankyou.

Note that the input shape is (512 , 8 , 512 , 2)

Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'nvidia' dirname: missing operand Try 'dirname --help' for more information. /opt/gridengine/default/spool/compute-0-3/job_scripts/107615: line 8: $'\342\200\213': command not found /opt/gridengine/default/spool/compute-0-3/job_scripts/107615: line 10: $'\342\200\213': command not found Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'nvidia' dirname: missing operand Try 'dirname --help' for more information. 2023-08-18 00:31:54.424305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2023-08-18 00:31:55.681930: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7 2023-08-18 00:31:55.683778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7 2023-08-18 00:31:58.094461: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2023-08-18 00:31:58.138169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.139442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: pciBusID: 2c8f7:00:00.0 name: NVIDIA A100 80GB PCIe MIG 1c.4g.40gb computeCapability: 8.0 coreClock: 1.41GHz coreCount: 14 deviceMemorySize: 39.25GiB deviceMemoryBandwidth: 901.22GiB/s 2023-08-18 00:31:58.139467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2023-08-18 00:31:58.139498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2023-08-18 00:31:58.142424: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2023-08-18 00:31:58.143428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2023-08-18 00:31:58.146270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2023-08-18 00:31:58.147827: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2023-08-18 00:31:58.147873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2023-08-18 00:31:58.147974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.149157: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.150258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0 2023-08-18 00:31:58.162036: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2023-08-18 00:31:58.187824: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 2249595000 Hz 2023-08-18 00:31:58.191960: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0d8455290 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2023-08-18 00:31:58.191991: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2023-08-18 00:31:58.382680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.383651: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0d84bbc10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-08-18 00:31:58.383675: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA A100 80GB PCIe MIG 1c.4g.40gb, Compute Capability 8.0 2023-08-18 00:31:58.383972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.384824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: pciBusID: 2c8f7:00:00.0 name: NVIDIA A100 80GB PCIe MIG 1c.4g.40gb computeCapability: 8.0 coreClock: 1.41GHz coreCount: 14 deviceMemorySize: 39.25GiB deviceMemoryBandwidth: 901.22GiB/s 2023-08-18 00:31:58.384853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2023-08-18 00:31:58.384867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2023-08-18 00:31:58.384884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2023-08-18 00:31:58.384893: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2023-08-18 00:31:58.384902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2023-08-18 00:31:58.384910: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2023-08-18 00:31:58.384918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2023-08-18 00:31:58.384975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.385763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:31:58.386491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0 2023-08-18 00:31:58.386521: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2023-08-18 00:38:14.883150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-08-18 00:38:14.883439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] 0 2023-08-18 00:38:14.883454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0: N 2023-08-18 00:38:14.883929: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:38:14.884907: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-08-18 00:38:14.886230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow /job:localhost/replica:0/task:0/device:GPU:0 with 37896 MB memory) -> physical GPU (device: 0, name: NVIDIA A100 80GB PCIe MIG 1c.4g.40gb, pci bus id: 2c8f7:00:00.0, compute capability: 8.0) WARNING:tensorflow:sample_weight modes were coerced fromdevice ( ... to
['...'] WARNING:tensorflow:sample_weight modes were coerced from ... to
['...'] 2023-08-18 00:39:18.317685: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2023-08-18 00:39:55.998760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2023-08-18 00:48:21.634219: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_80' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once. 2023-08-18 00:51:29.190383: W tensorflow/core/kernels/gpu_utils.cc:48] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once. 2023-08-18 00:51:29.190988: F tensorflow/stream_executor/cuda/cudadnn.cc:516] Check failed: cudnnSetTensorNdDescriptor(handle.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (9 vs. 0)batch_descriptor: {count: 7 feature_map_count: 145 spatial: 513 9 513 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} /opt/gridengine/default/spool/compute-0-3/job_scripts/107615: line 13: 88262 Aborted python $PYTHON_SCRIPT