CH-Earth / summa

Structure for Unifying Multiple Modeling Alternatives:
http://www.ral.ucar.edu/projects/summa
GNU General Public License v3.0
79 stars 103 forks source link

add a default deflate_level=4 to compress netcdf output files #486

Closed guoqiang-tang closed 2 years ago

guoqiang-tang commented 2 years ago

Make sure all the relevant boxes are checked (and only check the box if you actually completed the step):

Description: SUMMA NetCDF output files are not compressed. For large-scale modeling, the output files will occupy large storage space and post-processing of output files will be needed.

Solution: A default NetCDF deflate_level=4 (same with Python netCDF4 default compression level) is added in def_output.f90 and modelwrite.f90. For a test case (Bow At Banff catchment; 2008-01-01 to 2009-12-31; 51 GRUs and 118 HRUs), the sizes of output _day.nc and _timestep.nc (outputControl.txt) are reduced from 10.6 MB and 40.7 MB to 7.2 MB and 21.4 MB, respectively.

andywood commented 2 years ago

This hardwires the compression choice into summa -- motivating a few quick questions:

guoqiang-tang commented 2 years ago

For the deflate level, a value of zero indicates no deflation is in use. This can avoid time cost due to netcdf compression. The influence is more obvious for small cases where SUMMA computation time and read/write time are comparable. How about adding the deflate level in the fileManager.txt? This only needs a minimum level of code modification. I can set the default deflate level as zero, and existing models do not need to change anything. Adding a new line in fileManager.txt can activate netcdf compression. If I add this option in outputControl, more complex code modifications are needed according to my understanding.

andywood commented 2 years ago

I think adding the deflate level as an option in outputControl is more logical/intuitive because that's where other info controlling output format is read, whereas the filemanager has only time control and file locations. Andrew's addition of the precision option can be a template for adding a compression option (eg 'outputCompressionLevel'). The other main difference between the two places is that the filemanager reading code is compiled before the codes that set up data structures, so the main way the variable would be propagated from there is through individual public variables rather than being added to, say, an output options data structure. When creating the keyword-based filemanager, I started to set up a file data structure but it became messy and I dropped it.

The other thing to do is to make the compression level read from the output control backward compatible, so that if the option is not included, it's just set to whatever the default may be. I'd lean toward making the default 0 but there are arguments for making it 4 -- some users won't ever think about it and for them maybe giving them smaller output is better?

Can you do some benchmarks on run time as well as output size? I'm just curious. The run time might not be impacted much.

guoqiang-tang commented 2 years ago

I looked at the "outputPrecision" option in outputControl, and agree that this is a suitable place to set the deflate level. I have changed the codes which you can review. A default deflate level is defined as 4 in globalData.f90. Users can add a line in the outputControl file to choose a different level of compression: deflate_level | 0 ! between 0 and 9

About the computation time, I will perform a more systematic comparison later. My tests on HPC did not show the same computation time even for the same setting. The difference is not large but can affect the comparison.

guoqiang-tang commented 2 years ago

Yes. I'll change the name after a computation time test.

andywood commented 2 years ago

Note -- please remember to commit an update to the 'what's new' file for this -- it's notable enough regarding a new functionality.

guoqiang-tang commented 2 years ago

Thank you. I noted that you have updated the 'what's new' file. About the compression time test, I chose an example basin and tested the run time using compression levels from 0 to 9 on both laptop and Compute Canada HPC. However, the results were chaotic. Lower compression levels (e.g., 0, 1) may have longer run time than higher compression levels (e.g., 6, 7). I suppose this was related to chunk sizes. I was distracted by other work and left aside this problem. Overall, I think there should be a positive relation between compression level and computation time.