Open xin3liang opened 2 months ago
Software versions OS: openEuler 22.03 LTS SP3, kernel 5.10.0-192.0.0.105.oe2203sp3 Lustre: 2.15.4 io500: io500-isc24_v3, master openMPI: v4.1.x branch 4.1.7a1 UCX: 1.16.0
Lustre cluster: network: 100Gib IB filesystem_summary: 54.9T client_num: 9, cores_per_node: 16, np: 144 Mdt: 24, ost: 96 Client(s): 9 client1,client10,client11,client13,client14,client2,client3,client4,client5 Server(s): 6 server1,server2,server3,server4,server5,server6 mgs nodes: 1 server1 mdts nodes: 6 server1 server2 server3 server4 server5 server6 osts nodes: 6 server1 server2 server3 server4 server5 server6
It looks like the MPIIO run wrote about 860 GiB in 355s:
Using actual aggregate bytes moved = 923312332800
And the POSIX run wrote about 7153 GiB in 1897s:
Using actual aggregate bytes moved = 7680164783616
so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.
I've always thought that MPIIO collective IO should be faster for the ior-hard-write
phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.
It looks like the MPIIO run wrote about 860 GiB in 355s:
Using actual aggregate bytes moved = 923312332800
And the POSIX run wrote about 7153 GiB in 1897s:
Using actual aggregate bytes moved = 7680164783616
so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.
OK, this explains the question, thanks a lot @adilger.
I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000
, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling
I've always thought that MPIIO collective IO should be faster for the
ior-hard-write
phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.
Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write
phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO.
It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf
The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out.
With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period.
MPI-IO performance with collectives is only good in some cases, unfortunately.
On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu @.***> wrote:
It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800
And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616
so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.
OK, this explains the question, thanks a lot @adilger https://github.com/adilger. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling
I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.
Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO.
It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf
— Reply to this email directly, view it on GitHub https://github.com/IO500/io500/issues/68#issuecomment-2074250861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks, @JulianKunkel for the explanation, I have a more clear understanding now.
For independent I/O, after running 300s, all the processes still need to wait for the syncs to flush data to disks, right?
Because ior test specify the option -e fsync – perform fsync upon POSIX write close
. So the whole time depends on how long the syncs will finish??
The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately. … On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu @.> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger https://github.com/adilger. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf — Reply to this email directly, view it on GitHub <#68 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE . You are receiving this because you are subscribed to this thread.Message ID: @.>
With -e fsync, it depends on the file system. If it does not have a client sided write cache - your Lustre shouldn't - then the fsync doesn't add much. Data was already transferred to the servers during each I/O.
On Wed, Apr 24, 2024 at 10:15 AM Xinliang Liu @.***> wrote:
Thanks, @JulianKunkel https://github.com/JulianKunkel for the explanation, I have a more clear understanding now. For independent I/O, after running 300s, all the processes still need to wait for the syncs to flush data to disks, right? Because ior test specify the option -e fsync – perform fsync upon POSIX write close. So the whole time depends on how long the syncs will finish??
The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately. … <#m7211590092082732659> On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu @.> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger https://github.com/adilger https://github.com/adilger https://github.com/adilger. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf https://phillipmdickens.github.io/pubs/paper1.pdf — Reply to this email directly, view it on GitHub <#68 (comment) https://github.com/IO500/io500/issues/68#issuecomment-2074250861>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE . You are receiving this because you are subscribed to this thread.Message ID: @.>
— Reply to this email directly, view it on GitHub https://github.com/IO500/io500/issues/68#issuecomment-2074352221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5STG2PPZNABBRDPCZXLY65SZXAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGM2TEMRSGE . You are receiving this because you were mentioned.Message ID: @.***>
When I run ior-hard-write with API MPIIO+collective(opemMPI ROMIO)
With the same np=144, I find that although the running time is reduced a lot, but the bandwidth is smaller than the POSIX API result (API: MPIIO+collective)
result (API: POSIX )
Why the running time is better but the bandwidth is worse??