Is it possible to build tf1.15 with cuda11 , to run tf1.x code on RTX 30XX?

Fannhhyy commented 3 years ago

https://github.com/nvidia/tensorflow This version tf1.15 can run with rtx30xx ，it only can run on Linux . I tried to build it on win , but failed .

fo40225 commented 3 years ago

I haven't got the RTX3090, but I think you can try those steps.

Download the latest CUDA toolkit, install the driver only.
Download the CUDA toolkit of the same version as the CUDA toolkit used when compiling the tensorflow binary file you are using, execute the installer, skip the driver install, install the CUDA runtime, check the PATH should contains the cuda bin folder.
Download the cudnn of the same version as the cudnn used when compiling the tensorflow binary file you are using and the cudnn's cuda version should be same as the cuda rumtime you just install, place the .dll to to the cuda bin folder.

If you use this repo's whl, 1.15 is built with cuda 10.1.243_426.00 / cudnn 7.6.4.38 for cuda 10.1. If you use the official pip package, I guess the cuda/cudnn version is 10.0/7.6.x.

Fannhhyy commented 3 years ago

我还没有RTX3090，但我认为您可以尝试这些步骤。

下载最新的CUDA工具包，仅安装驱动程序。

下载与编译所使用的tensorflow二进制文件时使用的CUDA工具包相同版本的CUDA工具包，执行安装程序，跳过驱动程序安装，安装CUDA运行时，检查PATH是否包含cuda bin文件夹。

下载与编译您使用的tensorflow二进制文件时使用的cudnn版本相同的cudnn，并且cudnn的cuda版本应与您刚安装的cuda rumtime相同，将.dll放入cuda bin文件夹中。

如果您使用此回购协议的whl，则将cuda 10.1.243_426.00 / cudnn 7.6.4.38用于cuda 10.1构建1.15。如果您使用官方的pip套件，我猜cuda / cudnn的版本是10.0 / 7.4.x。

Both version not work , the result of model inference is wrong .

fo40225 commented 3 years ago

You should use the CPU version of tensorflow to confirm that your model and code worked.

A misconfigured CUDA environment usually causes exceptions and exit.

Fannhhyy commented 3 years ago

You should use the CPU version of tensorflow to confirm that your model and code worked.

A misconfigured CUDA environment usually causes exceptions and exit.

I have a machine with three graphics cards --- GTX1080ti,RTX2080ti,RTX3070. Only RTX3070 not work.

fo40225 commented 3 years ago

所以你有一台機器上面安裝了三個世代的顯示卡，使用相同版本的驅動程式版本與CUDA函式庫與tf版本與原始碼跟模型但只有安培顯卡得到錯誤結果您可能真的遇到了舊版CUDA/cudnn在新顯卡上的bug

可以先試試將%APPDATA%\NVIDIA\ComputeCache清空，設定環境變數CUDA_CACHE_MAXSIZE=4294967295看能不能解決問題

要使用CUDA 11/cudnn 8建置原始的tf1.15，可能需要做非常多移植修好NVIDIA版本的source code在windows上的建置問題應該比較簡單

Fannhhyy commented 3 years ago

所以你有一台機器上面安裝了三個世代的顯示卡，使用相同版本的驅動程式版本與CUDA函式庫與tf版本與原始碼跟模型但只有安培顯卡得到錯誤結果您可能真的遇到了舊版CUDA/cudnn在新顯卡上的bug

可以先試試將%APPDATA%\NVIDIA\ComputeCache清空，設定環境變數CUDA_CACHE_MAXSIZE=4294967295看能不能解決問題

要使用CUDA 11/cudnn 8建置原始的tf1.15，可能需要做非常多移植修好NVIDIA版本的source code在windows上的建置問題應該比較簡單

对，我的同一台机器有三代显卡，同时跑keras的范例代码。清空缓存和使用环境变量使之不使用缓存我都试过，都不能正常工作，基于cuda11的tf2.4甚至工作也不正常，tf2.5 dev才能正常工作，但是我们的代码很难迁移过去。正在尝试编译nvidia版本，但是他硬编码了一部分东西导致无法在win下编译。请问您是居住大陆吗，如果您需要，我可以将rtx3070借给您。

fo40225 commented 3 years ago

方便說明一下您使用keras的範例重現問題的步驟嗎?

我想我應該能借到3090來做測試

Fannhhyy commented 3 years ago

方便說明一下您使用keras的範例重現問題的步驟嗎?

我想我應該能借到3090來做測試

使用tf1.15和keras2.3，keras\examples\cifar10_resnet.py 这样的案例都无法训练，训练会导致NaN。

fo40225 commented 3 years ago

Test result

Windows AMD Ryzen 7 5800x gigabyte x570 aorus elite F30 4x ADATA DDR4-3200 32GB Crucial P5 1TB GIGABYTE RTX 3090 TURBO 24GB Windows 10 Pro 1903 NVIDIA Driver 460.89 Anaconda 2020.02 keras 2.3.1

tensorflow-gpu 1.15.5 from pip CUDA 10.0.130 CUDNN 7.6.5.32 for cuda10.0 error: CUBLAS_STATUS_EXECUTION_FAILED

tensorflow from this repo 1.15.0\py37\CPU+GPU\cuda101cudnn76avx2 CUDA 10.1.243 CUDNN 7.6.5.32 for cuda10.1 loss: nan

Linux 2x Intel Xeon Gold 6248R 16x Samsung DDR4-2933 64GB ECC RDIMM Samsung PM983 1.92TB 2x GIGABYTE RTX 3090 TURBO 24GB ubuntu 20.04 5.4.0-62 NVIDIA Driver 460.32.03 kreas 2.3.1

nvcr.io/nvidia/tensorflow:20.03-tf1-py3 slow JIT, slow execute nvcr.io/nvidia/tensorflow:20.06-tf1-py3 JIT, slow execute nvcr.io/nvidia/tensorflow:20.07-tf1-py3 JIT, slow execute nvcr.io/nvidia/tensorflow:20.08-tf1-py3 JIT, slow execute nvcr.io/nvidia/tensorflow:20.09-tf1-py3 JIT, slow execute nvcr.io/nvidia/tensorflow:20.10-tf1-py3 OK nvcr.io/nvidia/tensorflow:20.11-tf1-py3 OK nvcr.io/nvidia/tensorflow:20.12-tf1-py3 OK

fo40225 commented 3 years ago

已修復nvidia的程式碼修改如下 https://github.com/NVIDIA/tensorflow/pull/14

基於此PR建置的whl在 https://github.com/fo40225/tensorflow-windows-wheel/tree/master/1.15.4+nv20.12/

建置環境 visual studio 2019 16.8 cuda 11.1.1 cudnn 8.0.5.39

fo40225 / tensorflow-windows-wheel

Is it possible to build tf1.15 with cuda11 , to run tf1.x code on RTX 30XX? #167