docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210

ROCm / tensorflow-upstream

TensorFlow ROCm port

https://tensorflow.org

Apache License 2.0

685 stars 93 forks source link

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

Open buaimaoxiansheng opened 8 months ago

buaimaoxiansheng commented 8 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Source

source

TensorFlow version

latest

Custom code

OS platform and distribution

Linux Ubuntu22.04.3

Mobile device

Linux Ubuntu22.04.3

Python version

3.9

Bazel version

No response

GCC/compiler version

gcc（Ubuntu 11.4.0-1ubuntu1~22.04）11.4.0

CUDA/cuDNN version

No response

GPU model and memory

MI210

Current behavior?

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 command will cause keras or keras.api problems when using the --model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Standalone code to reproduce the issue

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 command will cause keras or keras.api problems when using the \--model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Relevant log output

No response

wenchenvincent commented 8 months ago

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.

For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

buaimaoxiansheng commented 8 months ago

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

wenchenvincent commented 8 months ago

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

buaimaoxiansheng commented 8 months ago

Ok,thanks.I'll try.I'll let you know when i test the driver and update it.

buaimaoxiansheng commented 8 months ago

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table: 企业微信截图_1703059699258

企业微信截图_1703059713134

buaimaoxiansheng commented 8 months ago

Is the version you want to know the same as the one shown by rocm-smi --showdriverversion? I use rocm-smi --showdriverversion is the same as uname -r. Does it require a 6.2 kernel operating system?

wenchenvincent commented 8 months ago

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver.

wenchenvincent commented 8 months ago

I am not sure of the ./benchmark 0,1,2,3 command that you were referring to... Was it a script from NVIDIA?

buaimaoxiansheng commented 8 months ago

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

buaimaoxiansheng commented 8 months ago

wenchenvincent commented 8 months ago

https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html

wenchenvincent commented 8 months ago

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

buaimaoxiansheng commented 8 months ago

hello. I need to change version=5.7 from the figure below to 6.2.4, right?

No. 5.7 is the rocm version. It is different from the amdgpu driver version. The page shows how to upgrade the amdgpu kernel (to latest) with rocm5.7.

buaimaoxiansheng commented 8 months ago

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

wenchenvincent commented 8 months ago

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

buaimaoxiansheng commented 8 months ago

How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.

buaimaoxiansheng commented 8 months ago

We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.

wenchenvincent commented 8 months ago

If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs.

wenchenvincent commented 8 months ago

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

buaimaoxiansheng commented 8 months ago

buaimaoxiansheng commented 8 months ago

Yes.

Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.

buaimaoxiansheng commented 7 months ago

If you don't have a benchmark for NV, do you have a test for TFLOPS? Or MLPerf?

wenchenvincent commented 7 months ago

@sunway513 Do you know if we have any public benchmarks for training and inference on MI200?