Open buaimaoxiansheng opened 8 months ago
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.
For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.
For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion.
We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.
Could you try to upgrade your driver to see if helps?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion.
We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.
Could you try to upgrade your driver to see if helps?
Ok,thanks.I'll try.I'll let you know when i test the driver and update it.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion.
We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.
Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion.
We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.
Could you try to upgrade your driver to see if helps?
Is the version you want to know the same as the one shown by rocm-smi --showdriverversion? I use rocm-smi --showdriverversion is the same as uname -r. Does it require a 6.2 kernel operating system?
rocm-smi --showdriverversion
Yes, it is the version shown by rocm-smi --showdriverversion
. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the ./benchmark 0,1,2,3
command that you were referring to... Was it a script from NVIDIA?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
rocm-smi --showdriverversion
Yes, it is the version shown by
rocm-smi --showdriverversion
. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4?
rocm-smi --showdriverversion
Yes, it is the version shown by
rocm-smi --showdriverversion
. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4?
https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
rocm-smi --showdriverversion
Yes, it is the version shown by
rocm-smi --showdriverversion
. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver. I use uname -r and rocm-smi --showdriverversion to display the same, so I don't know the driverversion very well, as shown below. Can you tell me how to upgrade AMD GPU kernel driver version to 6.2.4 or 6.3.4?https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html
hello. I need to change version=5.7 from the figure below to 6.2.4, right?
No. 5.7 is the rocm version. It is different from the amdgpu driver version. The page shows how to upgrade the amdgpu kernel (to latest) with rocm5.7.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.
If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.
When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.
If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs. I tried to run the benchmark script on the AMD GPUs platform, but ran the test with an error.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.
When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
Yes.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.
When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.
When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
If you don't have a benchmark for NV, do you have a test for TFLOPS? Or MLPerf?
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon. When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
Yes.
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps?
In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands? After the test is completed, the data of each model test will appear, as shown in the following figure and table:
I am not sure of the command that you were referring to... Was it a script from NVIDIA?
./benchmark 0,1,2,3
Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.
I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).
I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?
I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.
We are pulling the tensorflow image in the docker environment. We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs). we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4. I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.
The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon. When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?
Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.
@sunway513 Do you know if we have any public benchmarks for training and inference on MI200?
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
latest
Custom code
No
OS platform and distribution
Linux Ubuntu22.04.3
Mobile device
Linux Ubuntu22.04.3
Python version
3.9
Bazel version
No response
GCC/compiler version
gcc(Ubuntu 11.4.0-1ubuntu1~22.04)11.4.0
CUDA/cuDNN version
No response
GPU model and memory
MI210
Current behavior?
I'm using ubuntu22.04.3 with ROCm version 5.7.1
I want to run python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?
For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:
Use python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 command will cause keras or keras.api problems when using the --model=resnet50 parameter. The model can't be found.
The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.
Standalone code to reproduce the issue
Relevant log output
No response