NVIDIA-AI-IOT / face-mask-detection

Face Mask Detection using NVIDIA Transfer Learning Toolkit (TLT) and DeepStream for COVID-19
MIT License
241 stars 94 forks source link

A100/Cuda11.4 tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED #35

Open karanveersingh5623 opened 2 years ago

karanveersingh5623 commented 2 years ago

Below are details of docker image and Nvidia smi

Docker image - nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 Server - Dell R740 OS - CentOS Linux release 8.1.1911 (Core) Docker - [root@mlperf1 ~]# docker --version Docker version 20.10.7, build f0df350 Cuda - 11.4 GPU - Nvidia A100s

[root@mlperf1 ~]# nvidia-smi Wed Oct 27 11:05:11 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:AF:00.0 Off | 0 | | N/A 38C P0 36W / 250W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:D8:00.0 Off | 0 | | N/A 42C P0 33W / 250W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Below are the logs when started training :-- root@d62860ede997:/workspace/face-mask-detection# tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \

                    -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                    -k $KEY \
                    -n resnet18_detector \
                    --gpus $NUM_GPUS

2021-10-26 10:45:21.645365: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:45:21.662062: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:45:24.668378: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-10-26 10:45:24.669238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-10-26 10:45:24.731904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:d8:00.0 2021-10-26 10:45:24.731957: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:45:24.732028: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:45:24.733421: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:45:24.734200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:45:24.735998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:45:24.736317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:af:00.0 2021-10-26 10:45:24.736399: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:45:24.736515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:45:24.737361: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:45:24.737455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:45:24.739121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:45:24.739726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 1 2021-10-26 10:45:24.739765: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:45:24.740054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:45:24.742686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:45:24.744696: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:45:24.744790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:45:24.779874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2021-10-26 10:45:24.779979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

2021-10-26 10:51:06.808215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-26 10:51:06.808281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-10-26 10:51:06.808295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-10-26 10:51:06.813715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37457 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0) Using TensorFlow backend. 2021-10-26 10:51:06,828 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/face-mask-detection/tlt_specs/detectnet_v2_train_resnet18_kitti.txt. 2021-10-26 10:51:06,834 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/face-mask-detection/tlt_specs/detectnet_v2_train_resnet18_kitti.txt 2021-10-26 10:51:07,208 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 3899 samples with a batch size of 24; each epoch will therefore take one extra step. 2021-10-26 10:51:07,208 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 81 steps per epoch with 24 processors; each processor will therefore take one extra step per epoch. 2021-10-26 10:51:10.812551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-26 10:51:10.812602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 1 2021-10-26 10:51:10.812611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1: N 2021-10-26 10:51:10.817200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37457 MB memory) -> physical GPU (device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:d8:00.0, compute capability: 8.0) Using TensorFlow backend. 2021-10-26 10:51:10,833 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/face-mask-detection/tlt_specs/detectnet_v2_train_resnet18_kitti.txt. 2021-10-26 10:51:10,837 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/face-mask-detection/tlt_specs/detectnet_v2_train_resnet18_kitti.txt 2021-10-26 10:51:11,236 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 3899 samples with a batch size of 24; each epoch will therefore take one extra step. 2021-10-26 10:51:11,236 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 81 steps per epoch with 24 processors; each processor will therefore take one extra step per epoch.


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 544, 960) 0


conv1 (Conv2D) (None, 64, 272, 480) 9472 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36928 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0] block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 136, 240) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu_1[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]


add_2 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0] block_1a_relu[0][0]


block_1b_relu (Activation) (None, 64, 136, 240) 0 add_2[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73856 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]


add_3 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0] block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 68, 120) 0 add_3[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu_1[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]


add_4 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0] block_2a_relu[0][0]


block_2b_relu (Activation) (None, 128, 68, 120) 0 add_4[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 295168 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 33024 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]


add_5 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0] block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 34, 60) 0 add_5[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu_1[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]


add_6 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0] block_3a_relu[0][0]


block_3b_relu (Activation) (None, 256, 34, 60) 0 add_6[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1180160 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131584 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]


add_7 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0] block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 34, 60) 0 add_7[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu_1[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]


add_8 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0] block_4a_relu[0][0]


block_4b_relu (Activation) (None, 512, 34, 60) 0 add_8[0][0]


output_bbox (Conv2D) (None, 8, 34, 60) 4104 block_4b_relu[0][0]


output_cov (Conv2D) (None, 2, 34, 60) 1026 block_4b_relu[0][0]

Total params: 11,200,458 Trainable params: 11,190,730 Non-trainable params: 9,728


2021-10-26 10:51:18,630 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False 2021-10-26 10:51:18,630 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False 2021-10-26 10:51:18,630 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0) 2021-10-26 10:51:18,630 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 88, io threads: 88, compute threads: 44, buffered batches: 4 2021-10-26 10:51:18,630 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 3899, number of sources: 1, batch size per gpu: 24, steps: 82 2021-10-26 10:51:18,758 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates. 2021-10-26 10:51:18.793585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:af:00.0 2021-10-26 10:51:18.794755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:d8:00.0 2021-10-26 10:51:18.794786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:51:18.794818: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:51:18.794853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:51:18.794874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:51:18.794894: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:51:18.794914: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:51:18.794932: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:51:18.799183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1 2021-10-26 10:51:19,039 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 2 2021-10-26 10:51:19,046 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights: 2021-10-26 10:51:19,046 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000 2021-10-26 10:51:19,592 [INFO] iva.detectnet_v2.scripts.train: Found 3899 samples in training set


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 544, 960) 0


conv1 (Conv2D) (None, 64, 272, 480) 9472 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36928 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0] block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 136, 240) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu_1[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]


add_2 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0] block_1a_relu[0][0]


block_1b_relu (Activation) (None, 64, 136, 240) 0 add_2[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73856 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]


add_3 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0] block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 68, 120) 0 add_3[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu_1[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]


add_4 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0] block_2a_relu[0][0]


block_2b_relu (Activation) (None, 128, 68, 120) 0 add_4[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 295168 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 33024 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]


add_5 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0] block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 34, 60) 0 add_5[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu_1[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]


add_6 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0] block_3a_relu[0][0]


block_3b_relu (Activation) (None, 256, 34, 60) 0 add_6[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1180160 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131584 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]


add_7 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0] block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 34, 60) 0 add_7[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu_1[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]


add_8 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0] block_4a_relu[0][0]


block_4b_relu (Activation) (None, 512, 34, 60) 0 add_8[0][0]


output_bbox (Conv2D) (None, 8, 34, 60) 4104 block_4b_relu[0][0]


output_cov (Conv2D) (None, 2, 34, 60) 1026 block_4b_relu[0][0]

Total params: 11,200,458 Trainable params: 11,190,730 Non-trainable params: 9,728


2021-10-26 10:51:21,662 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False 2021-10-26 10:51:21,662 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False 2021-10-26 10:51:21,662 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0) 2021-10-26 10:51:21,662 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 88, io threads: 88, compute threads: 44, buffered batches: 4 2021-10-26 10:51:21,663 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 3899, number of sources: 1, batch size per gpu: 24, steps: 82 2021-10-26 10:51:21,785 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates. 2021-10-26 10:51:21.819131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:af:00.0 2021-10-26 10:51:21.820356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:d8:00.0 2021-10-26 10:51:21.820392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:51:21.820428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:51:21.820452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:51:21.820467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:51:21.820482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:51:21.820496: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:51:21.820509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:51:21.824645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1 2021-10-26 10:51:22,059 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 1 of 2 2021-10-26 10:51:22,066 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights: 2021-10-26 10:51:22,066 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000 2021-10-26 10:51:22,346 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False 2021-10-26 10:51:22,346 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False 2021-10-26 10:51:22,346 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0) 2021-10-26 10:51:22,346 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 88, io threads: 176, compute threads: 88, buffered batches: 4 2021-10-26 10:51:22,346 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 974, number of sources: 1, batch size per gpu: 24, steps: 41 2021-10-26 10:51:22,382 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates. 2021-10-26 10:51:22,596 [INFO] iva.detectnet_v2.scripts.train: Found 3899 samples in training set 2021-10-26 10:51:22,644 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1 2021-10-26 10:51:22,650 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights: 2021-10-26 10:51:22,650 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000 2021-10-26 10:51:23,017 [INFO] iva.detectnet_v2.scripts.train: Found 974 samples in validation set 2021-10-26 10:51:26.742907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:af:00.0 2021-10-26 10:51:26.742994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:51:26.743041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:51:26.743074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:51:26.743090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:51:26.743105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:51:26.743119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:51:26.743133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:51:26.745478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2021-10-26 10:51:26.968410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:d8:00.0 2021-10-26 10:51:26.968524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-10-26 10:51:26.968669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:51:26.968734: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-10-26 10:51:26.968761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-10-26 10:51:26.968787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-10-26 10:51:26.968812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-10-26 10:51:26.968830: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-26 10:51:26.971026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 1 2021-10-26 10:51:27.078355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-26 10:51:27.078385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 1 2021-10-26 10:51:27.078392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1: N 2021-10-26 10:51:27.166997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-26 10:51:27.167036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-10-26 10:51:27.167045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-10-26 10:51:27.285792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37457 MB memory) -> physical GPU (device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:d8:00.0, compute capability: 8.0) 2021-10-26 10:51:27.288712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37457 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0) 2021-10-26 10:51:35.503289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:52:00.791520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-26 10:52:41.008907: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED 2021-10-26 10:52:41.008970: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]] [[gradients/resnet18_nopool_bn_detectnet_v2/block_1b_bn_1/FusedBatchNormV3_grad/FusedBatchNormGradV3/_5859]] (1) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/tlt-train-g1", line 8, in sys.exit(main()) File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main File "", line 2, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 149, in run_training_loop File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[gradients/resnet18_nopool_bn_detectnet_v2/block_1b_bn_1/FusedBatchNormV3_grad/FusedBatchNormGradV3/_5859]] (1) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul': File "/usr/local/bin/tlt-train-g1", line 8, in sys.exit(main()) File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main File "", line 2, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 599, in train_gridbox File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 430, in build_training_graph File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 579, in get_dataset_tensors File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/pipeline.py", line 231, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 146, in process File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/augment/random_flip.py", line 78, in call File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 2716, in matmul return batch_mat_mul_fn(a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 1712, in batch_mat_mul_v2 "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

2021-10-26 10:52:42.314258: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED 2021-10-26 10:52:42.314306: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]] [[gradients/AddN_57/_4723]] (1) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/tlt-train-g1", line 8, in sys.exit(main()) File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main File "", line 2, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 149, in run_training_loop File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[gradients/AddN_57/_4723]] (1) Internal: Blas xGEMMBatched launch failed : a.shape=[24,3,3], b.shape=[24,3,3], m=3, n=3, k=3, batch_size=24 [[node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul': File "/usr/local/bin/tlt-train-g1", line 8, in sys.exit(main()) File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main File "", line 2, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 599, in train_gridbox File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 430, in build_training_graph File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 579, in get_dataset_tensors File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/pipeline.py", line 231, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 146, in process File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/processors/transform_processor.py", line 275, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/processors.py", line 240, in call File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/processors/augment/random_flip.py", line 78, in call File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 2716, in matmul return batch_mat_mul_fn(a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 1712, in batch_mat_mul_v2 "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()


Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[47981,1],0] Exit code: 1

aravindj76 commented 2 years ago

Hi Karan,

Were you able to resolve this issue. I am getting exactly the same error. Any suggestions if you have resolved it.

Thanks.