Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image

panda1100 commented 6 months ago

Required Info:

AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0
Full cluster configuration without any credentials or personal data.
- cluster configuration https://rpa.st/AAJQ
- whole procedure https://ciq.com/blog/how-to-use-aws-parallelcluster-3-8-0-with-rocky-linux-8/
Cluster name: rocky8-cluster

Output of pcluster describe-cluster command.

{
"creationTime": "2023-12-20T16:19:01.897Z",
"headNode": {
"launchTime": "2023-12-20T16:23:42.000Z",
"instanceId": "i-******",
"publicIpAddress": "***.***.***.***",
"instanceType": "t2.xlarge",
"state": "running",
"privateIpAddress": "10.0.0.230"
},
"version": "3.8.0",
"clusterConfiguration": {
"url": "******"
},
"tags": [
{
  "value": "3.8.0",
  "key": "parallelcluster:version"
},
{
  "value": "rocky8-cluster",
  "key": "parallelcluster:cluster-name"
}
],
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "rocky8-cluster",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:******:stack/rocky8-cluster/******",
"lastUpdatedTime": "2023-12-20T16:19:01.897Z",
"region": "ap-northeast-1",
"clusterStatus": "CREATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}

[Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:

Issue
- Intel MPI Benchmark IMB-MPI1 PingPong performance with EFA looks not as expected
- It looks like EFA was used according to the log attached to this ticket
- I would like to confirm if this is the expected range of performance.
- [result 1] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug srun --mpi=pmix ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
- [result 2] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug mpirun --mca pml cm --mca mtl ofi ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
```
#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
#bytes #repetitions      t[usec]   Mbytes/sec
268435456            1     99276.95      2703.91
```
The steps to reproduce the behavior.
- Build Intel MPI Benchmark
- ```
cd ~/
git clone https://github.com/intel/mpi-benchmarks.git
cd mpi-benchmarks/src_c
make all
```
- Submit job. Example job script job.sh https://rpa.st/WBZQ
- Output slurm-*.out https://rpa.st/22UA

jdeamicis commented 6 months ago

Hi, I'm not able to see any of the attachments you provided to the issue.

In addition to this, have you tried running the benchmarks using the installation of OpenMPI provided on the AMI?

panda1100 commented 6 months ago

@jdeamicis My apologies for the inconvenience. I uploaded agin. (It was misconfiguration of expiry date ...) Yes, OpenMPI is /opt/amazon/openmpi.

Thank you

jdeamicis commented 6 months ago

Apologies, I had misread the title of the ticket and automatically assumed you wanted to benchmark Intel MPI on a PC cluster :)

I can now see the attachments, thanks.

jdeamicis commented 6 months ago

Could you please repeat the experiment increasing the number of iterations at large message sizes? You should be able to control it via the -time or the -iter_policy options of the IMB benchmarks.

jdeamicis commented 6 months ago

Also, what happens if you use a parallel transfer benchmark such as the IMB1 Uniband or the OSU bw_mbw?

panda1100 commented 6 months ago

@jdeamicis Thank you, this is PingPong result. Tried -iter 10 and -iter 30 and results are almost identical.

IMB-MPI1 -msglog 3:28 -iter_policy off -iter 10 PingPong https://rpa.st/RZVA

panda1100 commented 6 months ago

I have issue with IMB-MPI1 Uniband, get back to here once it's solved.. https://rpa.st/UJZQ

panda1100 commented 6 months ago

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...

#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

https://rpa.st/ZC2A

jdeamicis commented 6 months ago

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...
#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
https://rpa.st/ZC2A

Depending on the task distribution settings used in your job, you may be using 2 processes on the same node here, so this may be a shared memory transfer rather than over EFA. Only parallel transfer benchmarks like Uniband and the OSU bw_mbw can really exploit multiple pairs of ranks (or you could use some collectives), but please make sure you are communicating across nodes and not within nodes!

jdeamicis commented 6 months ago

Another thing: could you please try using the OSU benchmarks to exclude any (unlikely) issue with the IMB benchmarks? Thanks!

hgreebe commented 6 months ago

I ran the Intel MPI Benchmarks and got similar results: https://rpa.st/AU5A

When I ran the osu_bw benchmark I got better performance: https://rpa.st/W7LQ

So the difference in performance is either something about how the applications were compiled, or something within the actual applications.

panda1100 commented 6 months ago

@jdeamicis Thank you, good point. Will do OSU, my apologies for the delay. @hgreebe Thank you. Could you please paste again with more longer life time (rpaste --life 1week) or use this https://rpa.st/ with Expiry (forever).

panda1100 commented 6 months ago

@jdeamicis OSU mbw_mr results srun --mpi=pmix ~/osu-micro-benchmarks-5.6.2/mpi/pt2pt/osu_mbw_mr 2 nodes 2 pair https://rpa.st/YLSA 2 nodes 8 pair https://rpa.st/W3TQ

hgreebe commented 6 months ago

Intel MPI Benchmarks: https://rpa.st/U5WQ

OSU Benchmarks: https://rpa.st/X3ZQ

jdeamicis commented 6 months ago

@panda1100 OK, it seems to me that the difference we are seeing is related to the type of MPI communication used in the two benchmarks: IMB PingPong (and osu_latency) use blocking communication, while osu_bw (and IMB Uniband) use non-blocking communication. Some difference is expected, but I personally wasn't expecting that much. This should be further investigated.

panda1100 commented 6 months ago

Thank you @jdeamicis . Please let me know if I can help on this.

This is OSU benchmarks osu_bw result. (previous one I used osu_mbw_mr

OSU Benchmarks (osu_bw) https://rpa.st/H4NQ

Intel MPI Benchmarks (pingpong) https://rpa.st/RZVA

aws / aws-parallelcluster

Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image #5995