Open rjdave opened 2 years ago
I can't say much except that this is consistent with my own experience.
That is happening because PIO automatically turns on zlib compression for data in netCDF/HDF5 files. That's quite slow.
Using the new netCDF integration feature, you can use PIO with the netCDF APIs, and it does not automatically turn on compression - you must explicitly turn it on for each variable in the netCDF API. In that case, you will see much faster write times for netCDF/HDF5 files.
I am presenting a paper at the AGU about compression, here's a graph that illustrates how much zlib impacts performance:
Note how large the write rate is for comression = "none".
@edwardhartnett although this is true for iotype=3, I don't think it's the case for iotype=4.
It does not appear that any of the modes use compression. When I ncdump -hs
each of the output files, none of them have a _DeflateLevel
attribute.
OK, sorry, you are quite right. So why so slow?
Are the chunksizes set to match the chunks of data being written?
When I run the test built with #define VARINT 1
or #define VARREAL 1
then the chunks are 1 record in size:
netcdf pioperf.1-0006-3 {
dimensions:
dim000001 = 9600000 ;
time = UNLIMITED ; // (10 currently)
variables:
int vari0001(time, dim000001) ;
vari0001:_FillValue = -2147483647 ;
vari0001:_Storage = "chunked" ;
vari0001:_ChunkSizes = 1, 960000 ;
vari0001:_Endianness = "little" ;
vari0001:_NoFill = "true" ;
...
When I switch to #define VARDOUBLE 1
it's approximately 1/19 of a record:
netcdf pioperf.1-0006-4 {
dimensions:
dim000001 = 9600000 ;
time = UNLIMITED ; // (10 currently)
variables:
double vard0001(time, dim000001) ;
vard0001:_FillValue = 9.96920996838687e+36 ;
vard0001:_Storage = "chunked" ;
vard0001:_ChunkSizes = 1, 505264 ;
vard0001:_Endianness = "little" ;
vard0001:_NoFill = "true" ;
...
It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better.
Try making the chunksize for the first dimension greater than 1, and the chunksize for the second dimension smaller. Chunks do better when they are more square shaped.
Also what is the write patter of each processor? That would be the best chunksize...
On Tue, Dec 7, 2021 at 1:50 PM rjdave @.***> wrote:
When I do run the test built with #define VARINT 1 or #define VARREAL 1 then the chunks are 1 record in size:
netcdf pioperf.1-0006-3 { dimensions: dim000001 = 9600000 ; time = UNLIMITED ; // (10 currently) variables: int vari0001(time, dim000001) ; vari0001:_FillValue = -2147483647 ; vari0001:_Storage = "chunked" ; vari0001:_ChunkSizes = 1, 960000 ; vari0001:_Endianness = "little" ; vari0001:_NoFill = "true" ; ...
When I switch to `#define VARDOUBLE 1" it's approximately 1/19 of a record:
netcdf pioperf.1-0006-4 { dimensions: dim000001 = 9600000 ; time = UNLIMITED ; // (10 currently) variables: double vard0001(time, dim000001) ; vard0001:_FillValue = 9.96920996838687e+36 ; vard0001:_Storage = "chunked" ; vard0001:_ChunkSizes = 1, 505264 ; vard0001:_Endianness = "little" ; vard0001:_NoFill = "true" ; ...
It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCAR/ParallelIO/issues/1893#issuecomment-988254455, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCSXXC6EUPFYZ5TIXL6GYTUPZXSJANCNFSM5JPSD4HA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Is there anything we can do to make a better guess about chunksize within the parallelIO library? Perhaps there is information in the decomp we can use to improve netcdf4 parallel performance?
It is a hard problem. Default chunksizes are chosen by the netcdf-c library, and it's very hard to choose good ones. Essentially, the programmer must match the chunksizes to their IO.
If you have a bunch of processors each writing slices of data of size X, Y, Z - then X, Y, Z is a good chunksize. But how am I going to guess that with just the information in netcdf metadata? There is no clue.
Using the decomp is a good idea to come up with a different set of chunksizes, but I don't have time to look at that - I've just taken over NOAA's GRIB libraries and there is so much to do there...
In pio2.5.5 we have added a fortran interface to PIOc_write_nc_decomp and PIOc_read_nc_decomp and I have written a program to translate decomps in the old text format to the new netcdf format. I would like to store these files someplace that is publicly accessible instead of in the cgd subversion server - any suggestions as to where would be the best place?
I have been testing PIO 2.5.4 in the ROMS ocean model for a while now. Late last week I started testing the cluster I'm working on with the
tests/performance/pioperf
provided by PIO. I have only tried with generated data since the Subversion repository mentioned intests/performance/Pioperformance.md
is password protected. This required a switch to building with cmake instead of autotools (#1892), but the results I'm getting seem fairly inline with what I'm seeing in my PIO enabled ROMS ocean model. My ROMS model uses an PIO 2.5.4 configured with autotools without timing enabled but all compilers, libraries, and other options the same as the cmake build.I am running on 3 nodes of a research cluster. Each node has dual 16-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters and storage is provided by IBM Spectrum Scale (GPFS). Below is my
pioperf.nl
:And the results are:
As you can see, the slowest write time is for parallel NetCDF4/HDF5 files. On this system, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF v1.12.2 are configured and built by me with the Intel compiler and MPI (v19.1.5).
I also have access to a second research cluster with dual 20-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters with lustre storage. Not quite apples to apples but fairly close. On this machine, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF 1.12.1 are all configured and built with Intel 2020 and Intel MPI by the system administrators. Here are the results on that system with the same pioperf.nl:
All tests were run at least five times on each cluster. I did not average them but the runs shown are consistent with the other runs on the system. You can see that they both perform pretty well with pnetcdf (iotype=1) and pretty poorly with parallel writes using the NetCDF4/HDF5 library (iotype=4). Obviously, there are other intriguing differences here but I would like to focus on the poor parallel wrting speeds for NetCDF4/HDF5. Even compared to serial writes with NetCDF4/HDF5 (iotype=3) the parallel wrting is slower.
Does anyone have any insights as to what may be happening here?