ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.03k stars 216 forks source link

Rather low training performance on "AI Benchmark" with MI100 #2048

Open Epliz opened 1 year ago

Epliz commented 1 year ago

Hi,

Just to check if I set up my machine with a MI100 GPU correctly I ran the "AI Benchmark" from https://ai-benchmark.com/ranking_deeplearning_detailed.html . The inference speed is pretty good, but the training one is for some sub-benchmarks quite far from where I would imagine it could be.

Installation instructions are at https://ai-benchmark.com/alpha.html .

Results I get:

>>   AI-Benchmark-v.0.1.2   
>>   Let the AI Games begin..

*  TF Version: 2.11.0
*  Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.35
*  CPU: N/A
*  CPU RAM: 31 GB
*  GPU/0: AMD Instinct MI100
*  GPU RAM: 31.0 GB
*  CUDA Version: N/A
*  CUDA Build: N/A

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 33.0 ± 1.1 ms
1.2 - training  | batch=50, size=224x224: 2560 ± 12 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 39.8 ± 0.9 ms
2.2 - training  | batch=20, size=346x346: 1989 ± 9 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 40.3 ± 2.1 ms
3.2 - training  | batch=10, size=346x346: 1482 ± 4 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 53.0 ± 4.4 ms
4.2 - training  | batch=8, size=346x346: 950 ± 9 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 25.5 ± 1.1 ms
5.2 - training  | batch=10, size=346x346: 66.5 ± 0.9 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 35.0 ± 2.2 ms
6.2 - training  | batch=10, size=256x256: 108 ± 6 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 55.9 ± 0.9 ms
7.2 - training  | batch=2, size=224x224: 71.1 ± 0.7 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 50.7 ± 0.6 ms
8.2 - inference | batch=1, size=1536x1536: 44.8 ± 0.7 ms
8.3 - training  | batch=10, size=512x512: 111 ± 2 ms

9/19. VGG-19 Super-Res

9.1 - inference | batch=10, size=256x256: 52.0 ± 0.9 ms
9.2 - inference | batch=1, size=1024x1024: 86.9 ± 0.7 ms
9.3 - training  | batch=10, size=224x224: 110.3 ± 0.7 ms

10/19. ResNet-SRGAN

10.1 - inference | batch=10, size=512x512: 64.4 ± 0.8 ms
10.2 - inference | batch=1, size=1536x1536: 60.4 ± 0.9 ms
10.3 - training  | batch=5, size=512x512: 90.0 ± 1.3 ms

11/19. ResNet-DPED

11.1 - inference | batch=10, size=256x256: 63.2 ± 0.7 ms
11.2 - inference | batch=1, size=1024x1024: 110.4 ± 0.8 ms
11.3 - training  | batch=15, size=128x128: 93.8 ± 0.9 ms

12/19. U-Net

12.1 - inference | batch=4, size=512x512: 108.6 ± 0.7 ms
12.2 - inference | batch=1, size=1024x1024: 118.6 ± 0.7 ms
12.3 - training  | batch=4, size=256x256: 142.5 ± 0.9 ms

13/19. Nvidia-SPADE

13.1 - inference | batch=5, size=128x128: 53.1 ± 0.9 ms
13.2 - training  | batch=1, size=128x128: 75.2 ± 3.0 ms

14/19. ICNet

14.1 - inference | batch=5, size=1024x1536: 161 ± 3 ms
14.2 - training  | batch=10, size=1024x1536: 426 ± 10 ms

15/19. PSPNet

15.1 - inference | batch=5, size=720x720: 290.8 ± 0.8 ms
15.2 - training  | batch=1, size=512x512: 393 ± 2 ms

16/19. DeepLab

16.1 - inference | batch=2, size=512x512: 62.0 ± 1.3 ms
16.2 - training  | batch=1, size=384x384: 129 ± 6 ms

17/19. Pixel-RNN

17.1 - inference | batch=50, size=64x64: 496 ± 16 ms
17.2 - training  | batch=10, size=64x64: 2764 ± 98 ms

18/19. LSTM-Sentiment

18.1 - inference | batch=100, size=1024x300: 591 ± 27 ms
18.2 - training  | batch=10, size=1024x300: 1589 ± 195 ms

19/19. GNMT-Translation

19.1 - inference | batch=1, size=1x20: 197 ± 6 ms

Device Inference Score: 17625
Device Training Score: 9906
Device AI Score: 27531

For more information and results, please visit http://ai-benchmark.com/alpha

I installed the miopen kernels for gfx908 through the packaging manager, I am on Ubuntu 22.04.2 LTS, rocm 5.4.3, tensorflow 2.11 .

I would appreciate if you could indicate that it is the performance I should get as of now, or with some tuning it could be better. Given the training scores are not that great compared to the inference ones, I feel like there is something wrong and it should be better.

Best regards, Epliz

junliume commented 1 year ago

@JehandadKhan could we provide some way of on-site tuning for the user?

Epliz commented 1 year ago

Would it be possible for you to run the benchmark to get a reference? It takes less than 20 minutes to run.

ppanchad-amd commented 1 month ago

@Epliz Apologies for the lack of response. Can you please re-test with latest ROCm 6.1.2 and check if issue occurs? Thanks!