Speed up NCDEFAULT_get/put_vars code

Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.

BSD 3-Clause "New" or "Revised" License

509 stars 262 forks source link

Speed up NCDEFAULT_get/put_vars code #1381

Open DennisHeimbigner opened 5 years ago

DennisHeimbigner commented 5 years ago

It occurs to me that we should consider improving the performance of NCDEFAULT_get/put_vars since they are still being used for some dispatch tables.

The current implementations operate by reading/writing one element at a time, which is seriously inefficient.

We should explore some alternative implementations such as this.

Instead of reading element at a time, we allocate some larger block of memory that covers multiple strided elements at a time.
the stride code then reads one of these blocks and extracts the relevant strided elements.
Cost is memory of the block (per-call to get_vars). - should we keep a blockcache
For writing, we would read the block, then insert the strided elements, and then write out the whole block.

Anyone have any other suggestions for speed up?

wkliao commented 5 years ago

This strategy is referred as "data sieving" in MPI-IO. https://www.mcs.anl.gov/~thakur/papers/romio-coll.pdf

DennisHeimbigner commented 5 years ago

Thanks for the pointer.

edhartnett commented 5 years ago

I believe the only layer using the default vars is the HDF4 layer, which, as I recall, does not have a vars style function. It only uses the read, since HDF4 is read-only for netCDF.

DennisHeimbigner commented 5 years ago

I think the libsrc, and indirectly libdap2 dispatchers also use the default. Do not know about pnetcdf. [Added: also the user defined dispatch tests use it, which probably means that many UDF implementations will also.]

wkliao commented 5 years ago

The PnetCDF driver in NetCDF calls PnetCDF library directly, so no.

DennisHeimbigner commented 5 years ago

How does pnetcdf implement get_vars? Can we profitably use that code for our default implementation?

wkliao commented 5 years ago

Remember PnetCDF calls MPI-IO :) So, it relies on MPI-IO to do data sieving. If you like, I can point you to the codes in ROMIO that do the sieving.

DennisHeimbigner commented 5 years ago

Oh right. That pointer into ROMIO might be helpful, thanks.

wkliao commented 5 years ago

At first, you are warned the data sieving code in ROMIO is not for the faint of heart :)

In MPICH github repo, file src/mpi/romio/adio/common/ad_write_str.c implements the write request to contiguous or noncontiguous file space. Data sieving is used for the noncontiguous case. The temporary buffer used to read-modify-write is named writebuf while the user's write buffer is buf. The whole read-modify-write is from line 341 to line 455. The kernel is a C macro named ADIOI_BUFFERED_WRITE defined at line 11.

The read data sieving is in file src/mpi/romio/adio/common/ad_read_str.c

DennisHeimbigner commented 5 years ago

ok, maybe I will have to do it from scratch. I will see how bad it is.

edhartnett commented 5 years ago

I didn't realize that this was used for libsrc. That alone is good enough reason to improve it! ;-)

The pnetcdf layer calls the pnetcdf functions ncmpi_get|putvars*, so does not use this code.

wkliao commented 5 years ago

I think the best is to implement it from scratch. ROMIO has to deal with arbitrary lengths of strides where the intervals of strides can be irregular. NetCDF, on the other hand, only needs to deal with simpler cases where strides are regular and every stride is a single element.

czender commented 5 years ago

NCO uses an optimization called USE_NC4_SRD_WORKAROUND that is only invoked when there is a single strided dimension in a multi-dimensional variable and it is the first dimension. Who knows how often this is the case? When it is the case, then the optimization speeds things up by using _get_vara() instead of_get_vars() to read multiple contiguous values at once.