CentOS 7.9.2009 fails with Tensorflow 1.15.5+nv21.06: bad URI https://oauth2:J64G8MymaUmqNKG_N3rR@gitlab-master.nvidia.com/cudnn/cudnn_frontend.git

kognat-docs commented 2 years ago

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.9.2009
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): source
TensorFlow version: 1.15.5+nv21.06
Python version: 3.8
Installed using virtualenv? pip? conda?:
Bazel version (if compiling from source): 0.24.1
GCC/Compiler version (if compiling from source): 7.3.0
CUDA/cuDNN version: CUDA 11.3.1 CUDNN 8.2.1
GPU model and memory: RTX 3090

Describe the problem

Cannot access the URL here: https://github.com/NVIDIA/tensorflow/blob/r1.15.5%2Bnv21.06/tensorflow/workspace.bzl#L135

    new_git_repository(
        name = "cudnn_frontend_archive",
        build_file = clean_dep("//third_party:cudnn_frontend.BUILD"),
        patches = [clean_dep("//third_party:cudnn_frontend_header_fix.patch")],
        patch_args = ['-p1'],
        commit = "e9ad21cc61f8427bbaed98045b7e4f24bad57619",
        remote = "https://oauth2:J64G8MymaUmqNKG_N3rR@gitlab-master.nvidia.com/cudnn/cudnn_frontend.git"
    )

Provide the exact sequence of commands / steps that you executed before running into the problem

running bazel build to build from source to link against GLIBC 2.17 fails, with the following error

git clone https://github.com/NVIDIA/tensorflow.git git checkout r1.15.5+nv21.06

install working build toolchain using conda environment with libstdc++ 9.5 GCC 7.3.0 from SCL toolset and bazel 0.24.1 and Python 3.8 with other Python dependencies

./configure

./configure 
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/home/sam/.cache/bazel/_bazel_sam/install/96b7e79a4e60cc1d7fbf4394c4acc8a6/_embedded_binaries/A-server.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.24.1- (@non-git) installed.
Please specify the location of python. [Default is /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python]: 

Found possible Python library paths:
  /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages
Please input the desired Python library path to use.  Default is [/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: 
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: 
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: 
No TensorRT support will be enabled for TensorFlow.

Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/lib64'
        '/usr'
        '/usr/lib64//bind9-export'
        '/usr/lib64/atlas'
        '/usr/lib64/dyninst'
        '/usr/lib64/iscsi'
        '/usr/lib64/mysql'
        '/usr/lib64/qt-3.3/lib'
Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 11.3

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 8.2

Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.11

Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /home/sam/Documents/env-tf-1.15.5-nv21.06-centos

Found CUDA 11.3 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include
Found cuDNN 8 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include
Found NCCL 2 in:
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib
    /home/sam/Documents/env-tf-1.15.5-nv21.06-centos/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 3.5,3.7,5.0,5.2,6.0,6.1,7.0,7.5,8.0,8.6

Do you want to use clang as CUDA compiler? [y/N]: 
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /opt/rh/devtoolset-7/root/usr/bin/gcc]: 

Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

CC=/opt/rh/devtoolset-7/root/usr/bin/gcc CXX=/opt/rh/devtoolset-7/root/usr/bin/g++ bazel build --config=v1 //tensorflow/tools/pip_package:build_pip_package

more .tf_configure.bazelrc

build --action_env PYTHON_BIN_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python"
build --action_env PYTHON_LIB_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib/python3.8/site-packages"
build --python_path="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/bin/python"
build:xla --define with_xla_support=true
build --config=xla
build --action_env TF_USE_CCACHE="0"
build --action_env TF_CUDA_VERSION="11.3"
build --action_env TF_CUDNN_VERSION="8.2"
build --action_env TF_NCCL_VERSION="2.11"
build --action_env TF_CUDA_PATHS="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos"
build --action_env CUDA_TOOLKIT_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="3.5,3.7,5.0,5.2,6.0,6.1,7.0,7.5,8.0,8.6"
build --action_env LD_LIBRARY_PATH="/home/sam/Documents/env-tf-1.15.5-nv21.06-centos/lib"
build --action_env GCC_HOST_COMPILER_PATH="/opt/rh/devtoolset-7/root/usr/bin/gcc"
build --config=cuda
build --copt=-march=native
build --copt=-Wno-sign-compare
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

Any other info / logs Would prefer to build with 1.15+nv21.06 as it is the oldest version with NVIDIA library support via conda.

My clients are not interested in installed more current kernel drivers.

My clients are not interested in using Docker containers.

Error log:

ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: no such package '@cudnn_frontend_archive//': Traceback (most recent call last):
    File "/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/bazel_tools/tools/build_defs/repo/git.bzl", line 157
        _clone_or_update(ctx)
    File "/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/bazel_tools/tools/build_defs/repo/git.bzl", line 74, in _clone_or_update
        fail(("error cloning %s:\n%s" % (ctx....)))
error cloning cudnn_frontend_archive:
+ cd /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external
+ rm -rf /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
+ git clone https://oauth2:J64G8MymaUmqNKG_N3rR@gitlab-master.nvidia.com/cudnn/cudnn_frontend.git /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
Cloning into '/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive'...
fatal: unable to access 'https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git/': Could not resolve host: gitlab-master.nvidia.com; Unknown error
+ git clone https://oauth2:J64G8MymaUmqNKG_N3rR@gitlab-master.nvidia.com/cudnn/cudnn_frontend.git /home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive
Cloning into '/home/sam/.cache/bazel/_bazel_sam/a31973e2a597fcbee537e0e8a93418a8/external/cudnn_frontend_archive'...
fatal: unable to access 'https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git/': Could not resolve host: gitlab-master.nvidia.com; Unknown error

samhodge commented 2 years ago

Work around is to clone https://github.com/NVIDIA/cudnn-frontend.git into a directory one level above the tensorflow repository and to use the r1.15.5+nv22.05 branch source code

kognat-docs commented 2 years ago

This is functional on RHEL 7 with glibc 2.17

nluehr commented 1 year ago

The "workaround" of cloning cudnn-frontend from NVIDIA/cudnn-frontend noted above is the correct procedure as documented in the build instructions here.

The reference to the private repository (https://gitlab-master.nvidia.com/cudnn/cudnn_frontend.git) was replaced starting in 21.08 with the following.

    native.new_local_repository(
        name = "cudnn_frontend_archive",
        build_file = clean_dep("//third_party:cudnn_frontend.BUILD"),
        path = "../cudnn-frontend",
    )

kognat-docs commented 1 year ago

Sorry looks like I needed to get my glasses out and read the fine print

Thanks for the update

Sam

NVIDIA / tensorflow

CentOS 7.9.2009 fails with Tensorflow 1.15.5+nv21.06: bad URI https://oauth2:J64G8MymaUmqNKG_N3rR@gitlab-master.nvidia.com/cudnn/cudnn_frontend.git #64