NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
136 stars 53 forks source link

Questions Raised by pioperf #1893

Open rjdave opened 2 years ago

rjdave commented 2 years ago

I have been testing PIO 2.5.4 in the ROMS ocean model for a while now. Late last week I started testing the cluster I'm working on with the tests/performance/pioperf provided by PIO. I have only tried with generated data since the Subversion repository mentioned in tests/performance/Pioperformance.md is password protected. This required a switch to building with cmake instead of autotools (#1892), but the results I'm getting seem fairly inline with what I'm seeing in my PIO enabled ROMS ocean model. My ROMS model uses an PIO 2.5.4 configured with autotools without timing enabled but all compilers, libraries, and other options the same as the cmake build.

I am running on 3 nodes of a research cluster. Each node has dual 16-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters and storage is provided by IBM Spectrum Scale (GPFS). Below is my pioperf.nl:

&pioperf
decompfile=   'BLOCK',
 pio_typenames = 'pnetcdf' 'netcdf' 'netcdf4c' 'netcdf4p'
 rearrangers = 1,2
 nframes = 10
 nvars = 8
 niotasks = 6
 varsize = 100000
/

And the results are:

 (t_initf) Read in prof_inparm namelist from: pioperf.nl
  Testing decomp: BLOCK
 iotype=           1  of            4
 RESULT: write       BOX         1         6         8     1343.6480252327
  RESULT: read       BOX         1         6         8     6796.6537469962
 RESULT: write    SUBSET         1         6         8     2037.0848491243
  RESULT: read    SUBSET         1         6         8     2213.6412788202
 iotype=           2  of            4
 RESULT: write       BOX         2         6         8      878.4271379151
  RESULT: read       BOX         2         6         8     1686.7799257658
 RESULT: write    SUBSET         2         6         8      835.9381757852
  RESULT: read    SUBSET         2         6         8     1702.5454246362
 iotype=           3  of            4
 RESULT: write       BOX         3         6         8     1007.6473058227
  RESULT: read       BOX         3         6         8     2030.3052886453
 RESULT: write    SUBSET         3         6         8      942.8001105710
  RESULT: read    SUBSET         3         6         8     2156.0752216195
 iotype=           4  of            4
 RESULT: write       BOX         4         6         8      223.5714932068
  RESULT: read       BOX         4         6         8     2925.2752645271
 RESULT: write    SUBSET         4         6         8      232.4932193447
  RESULT: read    SUBSET         4         6         8     4293.8078647335

As you can see, the slowest write time is for parallel NetCDF4/HDF5 files. On this system, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF v1.12.2 are configured and built by me with the Intel compiler and MPI (v19.1.5).

I also have access to a second research cluster with dual 20-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters with lustre storage. Not quite apples to apples but fairly close. On this machine, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF 1.12.1 are all configured and built with Intel 2020 and Intel MPI by the system administrators. Here are the results on that system with the same pioperf.nl:

(t_initf) Read in prof_inparm namelist from: pioperf.nl
  Testing decomp: BLOCK
 iotype=           1  of            4
 RESULT: write       BOX         1         6         8     1267.8699411769
  RESULT: read       BOX         1         6         8     1091.6380573150
 RESULT: write    SUBSET         1         6         8     1235.7559619398
  RESULT: read    SUBSET         1         6         8     1412.1266585430
 iotype=           2  of            4
 RESULT: write       BOX         2         6         8      392.0196185289
  RESULT: read       BOX         2         6         8     1763.8134047182
 RESULT: write    SUBSET         2         6         8      397.0943986050
  RESULT: read    SUBSET         2         6         8     1830.4833729873
 iotype=           3  of            4
 RESULT: write       BOX         3         6         8      553.0218402955
  RESULT: read       BOX         3         6         8     3070.8227757982
 RESULT: write    SUBSET         3         6         8      537.8703873321
  RESULT: read    SUBSET         3         6         8     3111.6566202294
 iotype=           4  of            4
 RESULT: write       BOX         4         6         8      300.8776448667
  RESULT: read       BOX         4         6         8     3015.9222552535
 RESULT: write    SUBSET         4         6         8      348.4993834234
  RESULT: read    SUBSET         4         6         8     3060.4128763664

All tests were run at least five times on each cluster. I did not average them but the runs shown are consistent with the other runs on the system. You can see that they both perform pretty well with pnetcdf (iotype=1) and pretty poorly with parallel writes using the NetCDF4/HDF5 library (iotype=4). Obviously, there are other intriguing differences here but I would like to focus on the poor parallel wrting speeds for NetCDF4/HDF5. Even compared to serial writes with NetCDF4/HDF5 (iotype=3) the parallel wrting is slower.

Does anyone have any insights as to what may be happening here?

jedwards4b commented 2 years ago

I can't say much except that this is consistent with my own experience.

edwardhartnett commented 2 years ago

That is happening because PIO automatically turns on zlib compression for data in netCDF/HDF5 files. That's quite slow.

Using the new netCDF integration feature, you can use PIO with the netCDF APIs, and it does not automatically turn on compression - you must explicitly turn it on for each variable in the netCDF API. In that case, you will see much faster write times for netCDF/HDF5 files.

I am presenting a paper at the AGU about compression, here's a graph that illustrates how much zlib impacts performance: image

Note how large the write rate is for comression = "none".

jedwards4b commented 2 years ago

@edwardhartnett although this is true for iotype=3, I don't think it's the case for iotype=4.

rjdave commented 2 years ago

It does not appear that any of the modes use compression. When I ncdump -hs each of the output files, none of them have a _DeflateLevel attribute.

edwardhartnett commented 2 years ago

OK, sorry, you are quite right. So why so slow?

Are the chunksizes set to match the chunks of data being written?

rjdave commented 2 years ago

When I run the test built with #define VARINT 1 or #define VARREAL 1 then the chunks are 1 record in size:

netcdf pioperf.1-0006-3 {
dimensions:
        dim000001 = 9600000 ;
        time = UNLIMITED ; // (10 currently)
variables:
        int vari0001(time, dim000001) ;
                vari0001:_FillValue = -2147483647 ;
                vari0001:_Storage = "chunked" ;
                vari0001:_ChunkSizes = 1, 960000 ;
                vari0001:_Endianness = "little" ;
                vari0001:_NoFill = "true" ;
...

When I switch to #define VARDOUBLE 1 it's approximately 1/19 of a record:

netcdf pioperf.1-0006-4 {
dimensions:
        dim000001 = 9600000 ;
        time = UNLIMITED ; // (10 currently)
variables:
        double vard0001(time, dim000001) ;
                vard0001:_FillValue = 9.96920996838687e+36 ;
                vard0001:_Storage = "chunked" ;
                vard0001:_ChunkSizes = 1, 505264 ;
                vard0001:_Endianness = "little" ;
                vard0001:_NoFill = "true" ;
...

It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better.

edhartnett commented 2 years ago

Try making the chunksize for the first dimension greater than 1, and the chunksize for the second dimension smaller. Chunks do better when they are more square shaped.

Also what is the write patter of each processor? That would be the best chunksize...

On Tue, Dec 7, 2021 at 1:50 PM rjdave @.***> wrote:

When I do run the test built with #define VARINT 1 or #define VARREAL 1 then the chunks are 1 record in size:

netcdf pioperf.1-0006-3 { dimensions: dim000001 = 9600000 ; time = UNLIMITED ; // (10 currently) variables: int vari0001(time, dim000001) ; vari0001:_FillValue = -2147483647 ; vari0001:_Storage = "chunked" ; vari0001:_ChunkSizes = 1, 960000 ; vari0001:_Endianness = "little" ; vari0001:_NoFill = "true" ; ...

When I switch to `#define VARDOUBLE 1" it's approximately 1/19 of a record:

netcdf pioperf.1-0006-4 { dimensions: dim000001 = 9600000 ; time = UNLIMITED ; // (10 currently) variables: double vard0001(time, dim000001) ; vard0001:_FillValue = 9.96920996838687e+36 ; vard0001:_Storage = "chunked" ; vard0001:_ChunkSizes = 1, 505264 ; vard0001:_Endianness = "little" ; vard0001:_NoFill = "true" ; ...

It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCAR/ParallelIO/issues/1893#issuecomment-988254455, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCSXXC6EUPFYZ5TIXL6GYTUPZXSJANCNFSM5JPSD4HA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jedwards4b commented 2 years ago

Is there anything we can do to make a better guess about chunksize within the parallelIO library? Perhaps there is information in the decomp we can use to improve netcdf4 parallel performance?

edwardhartnett commented 2 years ago

It is a hard problem. Default chunksizes are chosen by the netcdf-c library, and it's very hard to choose good ones. Essentially, the programmer must match the chunksizes to their IO.

If you have a bunch of processors each writing slices of data of size X, Y, Z - then X, Y, Z is a good chunksize. But how am I going to guess that with just the information in netcdf metadata? There is no clue.

Using the decomp is a good idea to come up with a different set of chunksizes, but I don't have time to look at that - I've just taken over NOAA's GRIB libraries and there is so much to do there...

jedwards4b commented 2 years ago

In pio2.5.5 we have added a fortran interface to PIOc_write_nc_decomp and PIOc_read_nc_decomp and I have written a program to translate decomps in the old text format to the new netcdf format. I would like to store these files someplace that is publicly accessible instead of in the cgd subversion server - any suggestions as to where would be the best place?