NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Exception thrown with "No algorithm worked!" message on NGC 20.09 #27

Closed disembarrasing closed 3 years ago

disembarrasing commented 3 years ago

Prerequisites

Used script

import os
import tensorflow as tf
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

PATH = "/home/kw/DATASETS/liveness"
train_dir = os.path.join(PATH, "train")
validation_dir = os.path.join(PATH, "validation")
model_path = os.path.join(f"{PATH}/models", 'cp_weights_test-{epoch:04d}.h5')

BATCH_SIZE = 32
IMG_SIZE = (300, 300)

train_dataset = image_dataset_from_directory(train_dir,
                                             shuffle=True,
                                             batch_size=BATCH_SIZE,
                                             image_size=IMG_SIZE)
validation_dataset = image_dataset_from_directory(validation_dir,
                                                  shuffle=True,
                                                  batch_size=BATCH_SIZE,
                                                  image_size=IMG_SIZE)
class_names = train_dataset.class_names
val_batches = tf.data.experimental.cardinality(validation_dataset)
test_dataset = validation_dataset.take(val_batches // 5)
validation_dataset = validation_dataset.skip(val_batches // 5)

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size=AUTOTUNE)

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
    tf.keras.layers.experimental.preprocessing.RandomRotation(0.2),
])

preprocess_input = tf.keras.applications.resnet50.preprocess_input
IMG_SHAPE = IMG_SIZE + (3,)

base_model = tf.keras.applications.ResNet50(input_shape=IMG_SHAPE,
                                            include_top=False,
                                            weights='imagenet')

base_model.trainable = False
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

prediction_layer = tf.keras.layers.Dense(1, activation="sigmoid")

inputs = tf.keras.Input(shape=IMG_SHAPE)
x = data_augmentation(inputs)
x = preprocess_input(x)
x = base_model(x, training=False)
x = global_average_layer(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

cb_checkpointer = ModelCheckpoint(
    filepath=model_path,
    verbose=1,
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=False,
    mode='auto'
)

tensorboard_callback = TensorBoard(log_dir=f"{PATH}/training_process")

initial_epochs = 20

loss0, accuracy0 = model.evaluate(validation_dataset)

print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))

history = model.fit(train_dataset,
                    epochs=initial_epochs,
                    validation_data=validation_dataset,
                    callbacks=[tensorboard_callback, cb_checkpointer])

Run command

docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v /home/kw/DATASETS/liveness:/home/kw/DATASETS/liveness -v /home/kw/PycharmProjects/liveness_resnet/:/home/kw/PycharmProjects/liveness_resnet --name tf_2009 nvcr.io/nvidia/tensorflow:20.09-tf2-py3

Bug description

After running above script, an exception with "No algorithm worked!" message is thrown.

Traceback (most recent call last):
  File "resnet50_test.py", line 80, in <module>
    loss0, accuracy0 = model.evaluate(validation_dataset)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1379, in evaluate
    tmp_logs = test_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No algorithm worked!
     [[node functional_1/resnet50/conv1_conv/Conv2D (defined at resnet50_test.py:80) ]] [Op:__inference_test_function_9650]

Function call stack:
test_function

Full output

2020-10-02 14:56:25.284418: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Found 3745 files belonging to 2 classes.
2020-10-02 14:56:26.080110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-10-02 14:56:26.114436: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.114830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 14:56:26.114847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.121483: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:26.125960: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-10-02 14:56:26.127291: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-10-02 14:56:26.134645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-10-02 14:56:26.136239: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-10-02 14:56:26.136378: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:26.136497: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.137051: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.137504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 14:56:26.143549: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3600005000 Hz
2020-10-02 14:56:26.143911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e00e40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-02 14:56:26.143922: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-02 14:56:26.211870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.212328: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x576b780 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-02 14:56:26.212351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
2020-10-02 14:56:26.212605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.213213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 14:56:26.213243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.213268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:26.213281: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-10-02 14:56:26.213297: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-10-02 14:56:26.213318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-10-02 14:56:26.213337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-10-02 14:56:26.213354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:26.213457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.214100: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.214679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 14:56:26.214710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.450418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-02 14:56:26.450448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-10-02 14:56:26.450465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-10-02 14:56:26.450641: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.451012: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.451334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7019 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:26:00.0, compute capability: 7.5)
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
Found 2402 files belonging to 2 classes.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94773248/94765736 [==============================] - 4s 0us/step
2020-10-02 14:56:32.404900: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-02 14:56:32.404939: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-02 14:56:32.409414: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.11.0
2020-10-02 14:56:32.509387: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-02 14:56:33.290028: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:34.436539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:36.117721: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:1115 : Not found: No algorithm worked!
Traceback (most recent call last):
  File "resnet50_test.py", line 80, in <module>
    loss0, accuracy0 = model.evaluate(validation_dataset)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1379, in evaluate
    tmp_logs = test_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No algorithm worked!
     [[node functional_1/resnet50/conv1_conv/Conv2D (defined at resnet50_test.py:80) ]] [Op:__inference_test_function_9650]

Function call stack:
test_function

System information

Script provided above works perfectly on a twin configuration (same OS, Docker image, Docker version and OS driver version), but with GTX 1060.

duncanriach commented 3 years ago

Hi @disembarrasing, It doesn't look like you've set TF_DETERMINISTIC_OPS='true'. Are you attempting to get GPU-deterministic operation? What's your reasoning behind submitting this issue to this repro/project?

disembarrasing commented 3 years ago

Hi, I'm confused about this error, because I haven't changed anything inside the container, but it works differently on two separate systems. Moreover, as stated in the documentation, NGC TensorFlow Docker images, starting with version 19.06, implement GPU-deterministic op functionality. Version 19.12 (and beyond) also implements multi-algorithm deterministic cuDNN convolutions, which solves the problem of some layer configurations causing an exception to be thrown with the message "No algorithm worked!". Is the issue I'm experiencing unrelated and I should adress this issue somewhere else?

duncanriach commented 3 years ago

It's cool that you found your way to this repo. That documentation did come from me. However, it's supposed to be referring to the functionality that is enabled when TF_DETERMINISTIC_OPS is set to true or 1 (perhaps that's ambiguous). When not enabling op determinism, there are other potential causes of that exception being thrown. It means that TensorFlow was unable (for some reason) to find an appropriate cuDNN convolution algorithm (for functional_1/resnet50/conv1_conv/Conv2D in the specific example that you've provided).

I could dig into this, but it's almost certainly not related to framework determinism in TensorFlow. I'm figuring out where a more appropriate place to report this issue might be.

Meanwhile, I have a question about your setup. I see that the 2070, the hardware on which you're seeing this exception, has 8GB of memory. How much memory is on the 1060, the hardware on which you don't get the exception?

disembarrasing commented 3 years ago

Thank you for the reply. The 1060 I'm using has 6GB of memory.

duncanriach commented 3 years ago

Wonderful. That's a useful datapoint. Would you be able to run this script for me inside a stock TensorFlow container to see if this issue repros there? (I don't think it references anything NGC-specific.) If so, try using tensorflow/tensorflow:2.3.0-gpu.

disembarrasing commented 3 years ago

The issue does not occur, here's full output:

2020-10-02 23:24:49.057271: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Found 3745 files belonging to 2 classes.
2020-10-02 23:24:49.839316: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-02 23:24:49.878143: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.878517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 23:24:49.878532: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-02 23:24:49.879474: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-02 23:24:49.880414: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-02 23:24:49.880558: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-02 23:24:49.881470: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-02 23:24:49.881986: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-02 23:24:49.883908: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-02 23:24:49.883997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.884399: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.884723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 23:24:49.884953: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-02 23:24:49.889046: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3600015000 Hz
2020-10-02 23:24:49.889588: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x552c280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-02 23:24:49.889609: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-02 23:24:49.975379: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.975793: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x552e680 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-02 23:24:49.975815: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
2020-10-02 23:24:49.976053: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.976628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 23:24:49.976658: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-02 23:24:49.976678: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-02 23:24:49.976695: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-02 23:24:49.976715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-02 23:24:49.976731: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-02 23:24:49.976745: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-02 23:24:49.976766: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-02 23:24:49.976861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.977468: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:49.977996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 23:24:49.978027: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-02 23:24:50.253653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-02 23:24:50.253683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-10-02 23:24:50.253687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-10-02 23:24:50.253848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:50.254221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 23:24:50.254545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7020 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:26:00.0, compute capability: 7.5)
Found 2402 files belonging to 2 classes.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94773248/94765736 [==============================] - 3s 0us/step
2020-10-02 23:24:54.890545: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-02 23:24:54.890581: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-02 23:24:54.891008: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcupti.so.10.1
2020-10-02 23:24:54.991597: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-10-02 23:24:55.733489: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-02 23:24:56.972247: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
32/61 [==============>...............] - ETA: 2s - loss: 0.7563 - accuracy: 0.4346
duncanriach commented 3 years ago

Another awesome datapoint. Thanks. tensorflow/tensorflow:2.3.0-gpu uses CUDA 10.1 / cuDNN 7.6 whilst the NGC container uses CUDA 11.0 / cuDNN 8.0.4. This could very well be related to a bug in one of those products. I'll get back to you when I know how to proceed with getting this to the appropriate people.

duncanriach commented 3 years ago

I've filed an internal bug report for this problem and I've linked it back to this GitHub issue.

nluehr commented 3 years ago

I can reproduce using the 20.09-tf2 NGC container. A workaround is to enable allow_growth by exporting the following. TF_FORCE_GPU_ALLOW_GROWTH=true Using a development build of the upcoming 20.10-tf2 NGC container, I can no longer reproduce the failure (regardless of allow_growth setting).

duncanriach commented 3 years ago

@disembarrasing, please will you confirm that @nluehr's workaround (described above) solves your problem?

disembarrasing commented 3 years ago

I've tested the workaround and it works, thank you both! Would you mind if I close the issue after testing 20.10-tf2? I'd like to be 100% sure the bug is gone.

duncanriach commented 3 years ago

Wonderful. Yes, it's okay to leave this issue open until you've confirmed that the problem is also solved in 20.10-tf2 (without using the workaround).

disembarrasing commented 3 years ago

I've tested the latest version of ngc container and the problem is solved, thank you @nluehr and @duncanriach.