Slow processing of large, compressed netCDF files tied to a small chunk size

PaulWessel commented 4 years ago

I created a large global 1mx1m compressed (--IO_NC4_DEFLATION_LEVEL=9) netCDF file from one of the Sandwell/Smith IMG files; dimensions are 21600x17280. We noticed it took "forever" to get gmt grdinfo -M to report the min/max values (over 1 minute on my iMac) and gdalinfo took many minutes; yet this varied widely among Windows and other installations. Initial investigations determined the culprit to be a small default cache size setting when netCDF library is built [those who built their own netCDF library with a larger cache got much better response], but today I noticed that if I recreate the grid using --IO_NC4_CHUNK_SIZE=4096 the speed is back to normal. So, I am wondering if our default selection of chunk sizes (which seems to be in the 128-256 range) is just way too small for at least large grids like this one. Perhaps the _gmtnc_set_optimalchunksize function needs some more scrutiny. Not sure if @ldldr is listening but would like his opinion.

remkos commented 4 years ago

When using nccopy -d9 the (hopefully optimal) chunksize is computed automatically. Maybe we can cull some wisdom from that. A chunksize of 128-256 indeed sounds small.

Secondarily, I think compression level 9 is excessive. I have found that compression level 4 strikes a good balance between the amount of compression and the time to do the compressing and uncompressing.

We probably should review some of the resources we created with compression. Let's use this thread to:

Figure out what is the "optimal" chunksize and how nccopy does it.
Consider the default compression level of 4.
Implement this in code.
Identify netCDF sources that we provide that may need to be recompressed using nccopy.

Note: We can never satisfy all cases.

In 2D arrays where one dimension is time, the chunking in time should probably be small as people tend to read one time record at the time.
In 2D arrays of latitude and longitude, the chunking in both directions would be similar as people tend to read areas.

PaulWessel commented 4 years ago

I chose level 9 because the key thing has been the remote download and that is strictly a file size issue. When we switch to a tiling system and the download size required shrinks then perhaps 4 is a good compromise.

PaulWessel commented 4 years ago

Just adding here that since the beginning, IO_NC4_DEFLATION_LEVEL = 3. I think you are arguing above to maybe change to 4? This is separate from the chunk size issue above.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions.

PaulWessel commented 4 years ago

Note sure what to do. This takes over 5 minutes on my mac:

gmt grdcut earth_relief_01m.grd -R160/180/-90/-70 -G/tmp/subset.nc=ns

That would make any user just hit Ctrl-C in desperation. These are highly compressed files: earth_relief_01m.grd: format: netCDF-4 chunk_size: 4096,4096 shuffle: on deflation_level: 9

gmt grdcut earth_relief_01m.grd -R150/170/-90/-70 -G/tmp/subset.nc=ns

takes 5 seconds. Here are some more details: We have a two row/column padding and when we read from file we try to fill those in with data so that we have good boundary conditions for taking derivatives. Hhere, those two extra columns on the east side wrap around to the start of the grid, so we read in the grid in two chunks: 158:50-180:00 is read first, then the string -180 to -17:58. Somehow, reading that first section takes forever. Can @joa-quim and @seisman reproduce this?

seisman commented 4 years ago

Yes, same to me (netcdf 4.7.4).

joa-quim commented 4 years ago

It’s working for sometime so I confirm it. The padding a potentially slower operation. Probably there are many instances where it could be skipped, like for grdcut.

From: Paul Wessel notifications@github.com Sent: Friday, May 29, 2020 12:28 AM To: GenericMappingTools/gmt gmt@noreply.github.com Cc: Joaquim Manuel Freire Luís jluis@ualg.pt; Mention mention@noreply.github.com Subject: Re: [GenericMappingTools/gmt] Slow processing of large, compressed netCDF files tied to a small chunk size (#2402)

Note sure what to do. This takes over 5 minutes on my mac:

gmt grdcut earth_relief_01m.grd -R160/180/-90/-70 -G/tmp/subset.nc=ns

That would make any user just hit Ctrl-C in desperation. These are highly compressed files: earth_relief_01m.grd: format: netCDF-4 chunk_size: 4096,4096 shuffle: on deflation_level: 9

gmt grdcut earth_relief_01m.grd -R150/170/-90/-70 -G/tmp/subset.nc=ns

takes 5 seconds. Here are some more details: We have a two row/column padding and when we read from file we try to fill those in with data so that we have good boundary conditions for taking derivatives. Hhere, those two extra columns on the east side wrap around to the start of the grid, so we read in the grid in two chunks: 158:50-180:00 is read first, then the string -180 to -17:58. Somehow, reading that first section takes forever. Can @joa-quimhttps://github.com/joa-quim and @seismanhttps://github.com/seisman reproduce this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/GenericMappingTools/gmt/issues/2402#issuecomment-635669611, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEDF2IDXUMVFBNENEMEQNTRT3XQXANCNFSM4KCX2KKQ.

PaulWessel commented 4 years ago

It seems specific to the case where we add 2 columns beyond the east end of a global periodic grid. I agree there are times when the pad could be skipped like here; I will implement something - right now it does this regardless.

PaulWessel commented 4 years ago

OK, taht works fine. Help me list the places where BC do not matter. I have done grdcut. Maybe it is easier to determine where it does matter: grdsample, grdtrack, grdgradient, grdimage, grdproject, ?

joa-quim commented 4 years ago

grdtrack only if profiles touche the edges. But why grdimage?

PaulWessel commented 4 years ago

It can call grdgradient implicitly via -I Also, when I did not do this for grdcontour I got streaks across the map. So for now I am putting it in where it is needed until no more failures.

Paul Wessel, Professor and Chair Dept. of Earth Sciences (formerly Geology & Geophysics) SOEST, U of Hawaii at Manoa

On May 28, 2020 at 2:56:15 PM, Joaquim (notifications@github.com) wrote:

grdtrack only if profiles touche the edges. But why grdimage?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GenericMappingTools/gmt/issues/2402#issuecomment-635695182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJ7IX5AYTH5HUC37N7ITSTRT4B27ANCNFSM4KCX2KKQ .

joa-quim commented 4 years ago

Then grdview needs it too

GenericMappingTools / gmt

Slow processing of large, compressed netCDF files tied to a small chunk size #2402