aime-team / pytorch-benchmarks

A benchmark framework for Pytorch
MIT License
19 stars 4 forks source link

Question about the benchmark result #2

Open knightXun opened 1 month ago

knightXun commented 1 month ago

Dear aime-team, I used your benchmarks tools to test the pytorch performance on A100s. I read your benchmarks results and the two A100s can faster then one A100 by 90%-95%. But I used the tools in my local environment, and found the ratio is just 50%-60%. Can you team provider more infomation about your benchmarks: such as your hardware information, the param of the benchmark tools ?

Thank you!

carlovogel commented 1 month ago

Hi knightXun, thank you for your feedback to our Pytorch benchmark tool. We tried to reproduce your issue and repeated the benchmark of two A100s 80GB on the AIME Server A4000 with the CPU AMD EPYC 7543 and 512 GB RAM.

With Pytorch 2.3.1 and Cuda-version 12.1 in an AIME ML container utilizing the default benchmark model resnet50 we got the following results:

1xA100 80GB:

score = 1012 images/s

Command: python3 main.py --num_gpus 1 --benchmark_train --batch_size 768

2xA100 80GB:

score = 1994 images/s

Command: python3 main.py --num_gpus 2 --benchmark_train --batch_size 768

The flag --benchmark_train just limits the training steps (60 steps by default) and calculates the mean images per second when finished.

Maybe there is an issue with your nvidia driver or CUDA configuration? How much RAM does your machine provide? Your RAM should at least have the size of your combined GPU VRAM for full performance. Is your CPU performant enough? Please post your configuration so we can help you investigate your issue.

Best regards

Carlo Vogel Software Developer @ AIME - HPC Cloud & Hardware

knightXun commented 1 month ago

Here is my hardware information: CPU: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, and we use docker to limit within 12 Core RAM: 180GB DDR4 with 3200 MT/s GPU:nvidia.com/gpu-A100PCIE80GB* 2 Shm:180 GiB network: Infiniband storage: cephfs on SSD and HDD, 4k random read with 11k iops

My Cuda and GPU drivers info: NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2

My pytorch: 2.3.1

I guess my poor performance is due to the docker runtime and the limited resources, our container does not have exclusive access to physical CPU cores.

carlovogel commented 1 month ago

Yeah, I'm afraid you're right. Our AIME ML container are also docker based but do not limit the resources. Also we don't have much experience with Intel CPUs since we are mainly using AMD. So I can't guarantee good performance with your CPU even without the docker limitations.