ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 93 forks source link

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

Open buaimaoxiansheng opened 8 months ago

buaimaoxiansheng commented 8 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

latest

Custom code

No

OS platform and distribution

Linux Ubuntu22.04.3

Mobile device

Linux Ubuntu22.04.3

Python version

3.9

Bazel version

No response

GCC/compiler version

gcc(Ubuntu 11.4.0-1ubuntu1~22.04)11.4.0

CUDA/cuDNN version

No response

GPU model and memory

MI210

Current behavior?

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 command will cause keras or keras.api problems when using the --model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Standalone code to reproduce the issue

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 command will cause keras or keras.api problems when using the \--model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Relevant log output

No response

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.

For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.

For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

Ok,thanks.I'll try.I'll let you know when i test the driver and update it.

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258

企业微信截图_1703059713134

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

Is the version you want to know the same as the one shown by rocm-smi --showdriverversion? I use rocm-smi --showdriverversion is the same as uname -r. Does it require a 6.2 kernel operating system?

wenchenvincent commented 8 months ago

rocm-smi --showdriverversion

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver.

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258

企业微信截图_1703059713134

I am not sure of the ./benchmark 0,1,2,3 command that you were referring to... Was it a script from NVIDIA?

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

buaimaoxiansheng commented 8 months ago

rocm-smi --showdriverversion

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4? 企业微信截图_17031283886987

wenchenvincent commented 8 months ago

rocm-smi --showdriverversion

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4? 企业微信截图_17031283886987

https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

buaimaoxiansheng commented 8 months ago

rocm-smi --showdriverversion

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4? 企业微信截图_17031283886987

https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html

hello. I need to change version=5.7 from the figure below to 6.2.4, right? image

No. 5.7 is the rocm version. It is different from the amdgpu driver version. The page shows how to upgrade the amdgpu kernel (to latest) with rocm5.7.

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.

image

image

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.

If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs.

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.

image

image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.

If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs. I tried to run the benchmark script on the AMD GPUs platform, but ran the test with an error.

buaimaoxiansheng commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks. image image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

Yes.

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks. image image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.

buaimaoxiansheng commented 7 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks. image image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

If you don't have a benchmark for NV, do you have a test for TFLOPS? Or MLPerf?

wenchenvincent commented 7 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks. image image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon. When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

Yes.

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258 企业微信截图_1703059713134

I am not sure of the command that you were referring to... Was it a script from NVIDIA?./benchmark 0,1,2,3

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks. image image

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon. When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.

@sunway513 Do you know if we have any public benchmarks for training and inference on MI200?