Help with slow open performance with netCDF4 Files

mathomp4 commented 7 years ago

Environment Information

What platform are you using? (please provide specific distribution/version in summary)
- [x] Linux
- [ ] Windows
- [ ] OSX
- [ ] Other
- [ ] NA
32 and/or 64 bit?
- [ ] 32-bit
- [x] 64-bit
What build system are you using?
- [x] autotools (configure)
- [ ] cmake
Can you provide a sample netCDF file or C code to recreate the issue?
- [x] Yes (please attach to this issue, thank you!)
- [ ] No
- [ ] Not at this time

Summary of Issue

I work on a climate model and I/O is slowly becoming a major bottleneck. By default we try to use nc4 files as our main format (a few Fortran binaries remain, but we are trying to rid ourselves of that). We recently did an experiment to see if our slow performance might be due to the fact that many files we write are written with an unlimited time dimension, so when we read them, they are unlimited. So, we fixed the time dimension, no change. We also compress, chunk, etc. the files but still no big changes. Finally, a colleague suggested converting to netCDF Classic, and, hooboy. Muuuuuuch faster.

Eventually I went whole hog. I took our NC4 file, decompressed it, converted it using nccopy -k ncX MERR.nc4 MERR.ncX to nc3, nc5, nc6, nc7 and hacked together a simple program to test it out which you can see here:

https://github.com/mathomp4/netcdfopentester/blob/master/testCNetcdf.c

All it does is open and close a file 1000 times (might be overkill). If I run this on my test files, I get the following output:

(1929)(master) $ ./testCNetcdf.exe
nc3 unlimit dim open Time =  0.820 seconds.
nc4 unlimit dim open Time = 19.870 seconds.
nc5 unlimit dim open Time =  1.020 seconds.
nc6 unlimit dim open Time =  1.130 seconds.
nc7 unlimit dim open Time = 20.950 seconds.
nc4 unlimit dim open with hdf5 Time =  0.250 seconds.

I added what I think is the correct HDF5 code to open and close a file, but I mainly program netCDF in Fortran and Python, so even this program is a huge success for me! (Also: why it's ugly.) It seems like a netCDF open and an HDF5 open are ~2 orders of magnitude in speed difference. Is there a way to tune netCDF to open like HDF5? Or do you pay for HDF5's quick opening later?

Or, assuming I might have got the HDF5 wrong, why does netcdf Classic open so much faster as well?

Note: The versions of the libraries I'm using are netCDF-C 4.4.1.1 and HDF5 1.8.19. I compiled with Intel 17.0.4 compilers (backed up by GCC 5.3.0). The file is one from the MERRA2 data that can be retrieved (some basic instructions in my README in boring repo mentioned above).

edhartnett commented 7 years ago

Can you do an ncdump -s -h of the netCDF-4 file, and show the metadata?

mathomp4 commented 7 years ago

I can. Here is the file that is the one I uncompressed (MERR.nc4 in the above script).

MERR.nc4.ncdumphs.txt

DennisHeimbigner commented 7 years ago

Since I do not have the necessary authorization. Can you public post an actual problematic .nc4 file somewhere temporarily?

WardF commented 7 years ago

I'm currently working on a different issue, but I can easily believe that there is an inefficiency somewhere. Using the testCNetcdf.exe above I will, as soon as I can, generate a usage profile and will post the results here for discussion and to see where the overhead is occuring.

wkliao commented 7 years ago

Interesting evaluation! I can confirm the slow open timing when using the files produced from test program "nc_test" which contain a long list of variables and attributes.

% ./read1000 test.nc4 Open test.nc4 1000 times costs 18.308271 seconds % ./read1000 test.nc5 Open test.nc5 1000 times costs 0.146202 seconds

DennisHeimbigner commented 7 years ago

Opening a file with a large metadata is a known problem because, unlike HDF5, netcdf-4 reads in all the metadata at once, rather than lazily. However, looking at the file above (MERR.nc4.ncdumphs.txt) It does not seem to have all that much metadata (~900 lines). It has about 60 variables and about 800 attributes. I wonder if our attribute handling code is the problem; without looking, I would suspect a linear search problem.

DennisHeimbigner commented 7 years ago

After a quick look, it appears that attributes are stored in a singly linked list. And as each attribute is read, it is stored at the end of the list => linear. This is the case for all netcdf-4 metadata storage. Looks like it is time for a comprehensive rebuild of the libsrc4 code to refactor this to use at least arrays rather than lists. Hash tables would be better. If memory serves someone did this for the netcdf-3 code already, so perhaps we can share the code. In any case, this is a big change, so it is will not occur until at least the next+1 release (e.g. after 4.5.0).

WardF commented 7 years ago

I agree. 4.5.1 would be better, 4.5.0 is overdue.

mathomp4 commented 7 years ago

Since I do not have the necessary authorization. Can you public post an actual problematic .nc4 file somewhere temporarily?

Well, it sounds I'm a little late to the party, but I've put a compressed version of the file on Dropbox.

Opening a file with a large metadata is a known problem because, unlike HDF5, netcdf-4 reads in all the metadata at once, rather than lazily.

Ah. This makes sense. Although for a bit there, HDF5 opened really fast because I forgot to check the error status (in Fortran). Turns out it'll open very fast if it's just erroring out. But I've confirmed it works. If only I could figure out how to cast the HDF5 file pointer to nf90...best of both worlds!

After a quick look, it appears that attributes are stored in a singly linked list. And as each attribute is read, it is stored at the end of the list => linear.

But it sounds like @DennisHeimbigner might have figured it out. When I get into work tomorrow, I might construct a series of files with varying variables. Luckily I can create a stupid-long file with our climate model and then just remove variables with NCO/CDO. I'd like to know if I can reproduce a linear trend.

mathomp4 commented 7 years ago

I'd like to know if I can reproduce a linear trend.

Wrote a quick and dirty tester and...yup: netcdf open test

The R² of the fit didn't export with the image, but it was 0.9871. (https://plot.ly/~Fortran/333)

DennisHeimbigner commented 7 years ago

Thanks for the verification.

Dave-Allured commented 7 years ago

The virtual attribute _IsNetcdf4 was added in 4.4.1. I wondered if this was the performance problem, because it requires scanning all the file top metadata.

I compared Netcdf-C 4.4.0/HDF5 1.8.16 to current versions, 4.4.1.1 with 1.10.1. These timing results were actually run interleaved between versions, to compensate for possible network biases. I grouped by version and copy/pasted into this summary.

Netcdf version:
4.4.0 of Jan 15 2016 14:13:42 $
nc4 unlimit dim open Time = 33.390 seconds.
nc4 unlimit dim open Time = 39.030 seconds.
nc4 unlimit dim open Time = 36.940 seconds.
nc4 unlimit dim open Time = 42.890 seconds.
nc4 unlimit dim open Time = 41.310 seconds.
nc4 unlimit dim open Time = 43.640 seconds.
nc4 unlimit dim open Time = 44.250 seconds.
nc4 unlimit dim open with hdf5 Time =  0.550 seconds. (1.8.16, typical)

Netcdf version:
4.4.1.1 of May 22 2017 10:44:13 $
nc4 unlimit dim open Time = 51.660 seconds.
nc4 unlimit dim open Time = 55.440 seconds.
nc4 unlimit dim open Time = 54.300 seconds.
nc4 unlimit dim open Time = 56.420 seconds.
nc4 unlimit dim open Time = 51.510 seconds.
nc4 unlimit dim open with hdf5 Time =  0.530 seconds. (1.10.1, typical)

This proves that _IsNetcdf4 is not the primary culprit for slow opens, because the attribute was completely absent from netcdf-c 4.4.0. The main impact is presumably just as Dennis described. However there seems to be a smaller time increase between versions here. This might be _IsNetcdf4 or something else.

DennisHeimbigner commented 7 years ago

If memory serves, _IsNetcdf4. is computed only when asked for. So it should not be accessed as part of the above tests.

Dave-Allured commented 7 years ago

That makes sense. Not knowing the internals, I thought it was possible that nc_open might be pre-staging _IsNetcdf4 for some reason.

Dave-Allured commented 7 years ago

@mathomp4, your experiment focused on only nc_open. How much of your model I/O time is consumed in nc_open vs. other file I/O? There are some known serious inefficiencies when the netcdf-4 chunk layout conflicts with read and write access patterns, or when chunk caches are too small.

My general experience with netcdf-4 is that performance rivals netcdf-3 when the file structure is reasonably matched with the I/O access pattern. Also in my current model project, 20 to 200 milliseconds is pretty reasonable for a file open that occurs only three times in a 40-minute compute cycle. Your model's file open pattern might be much different, of course.

mathomp4 commented 7 years ago

@Dave-Allured Well, this was sort of an odd test. We were looking at I/O and noticed this, so we sort of focused on the open. Now, in most cases, the open probably isn't as important as the read, so we are looking at that too. But we were wondering about open/read speeds when NC4 files were unlimited time v. fixed time, compressed v. uncompressed, etc. But the open was dominating!

We've also encountered what we think is a bug in Intel MPI when reading compressed NC4 files in our model. It was very weird. We found if we uncompressed, the code worked. Now that we have solved the open mystery, we'll probably look at the read more carefully. Chunking, compression, etc.

But all that said, in one way of running our model we have to open approximately 500 files at start up and ~100 every hour. In that version which was run at very high resolution, I/O (mainly I in this case) was ~25% of the entire model run. Thus, looking at I/O strategies now.

Dave-Allured commented 7 years ago

@mathomp4, all things considered, it seems that NC3 format would be a good fit to improve I/O for this particular model. File opens will be much faster as demonstrated here. I think data reads and writes will be similar to NC4, possibly faster for some access patterns. The main thing you give up with NC3 is NC4 data compression. It seems like speed is your main issue, not storage space.

Watch out for NC3 file structure issues which could bog down data writes. You should pre-define all variables and attributes in a new NC3 file, before writing any large variables. Or else utilize the NC3 padding capability. Otherwise, adding or expanding any variable or attribute can trigger an internal re-write of the whole file when you are not careful. NC4 is thankfully free of this constraint because of dynamic storage management. Please excuse me if I am explaining stuff you already know.

If you start now with netcdf-3 format, you should be able to easily upgrade to netcdf-4 if nc_open is improved in a future release.

DennisHeimbigner commented 6 years ago

Update: I have been experimenting with the netcdf4 code in an attempt to improve the performance of nc_open. I thought the issue was that when the metadata was very large -- multi-megabytes -- the linear searches used by the open() code were causing the slow open speeds. After running some experiments with highly modified code, I am beginning to believe that the problem is that the open cost is dominated by the cost for the hdf5 library to read all the metadata. Currently, we do not/cannot make use of HDF5 lazy reading of the metadata because we always read in all of the metadata.

So, what can be done. Converting netcdf to fully use lazy metadata read is known to be a large undertaking. However providing lazy read for attributes seems to be possible because it appears that most often, attributes are accessed by name. Does anyone have any insight on this?

edhartnett commented 6 years ago

I have (mostly) split the HDF5 code from the libsrc4 data model code. I now see how lazy reads can work for (almost?) all HDF5 objects. I don't think it will be as hard as I previously estimated.

I deduce from @WardF's recent activity that my HDF4 dispatch PR is soon to be merged, so I can present the HDF5/libsrc4 split after that and we can discuss it.

After that, it will be an easy step to add lazy reads. Also easy would be to use an array for atts, as we now do for vars, instead of linked lists.

DennisHeimbigner commented 6 years ago

After some experimentation, my current speculation is that the only thing that needs lazy reading is attributes, and since attributes are mostly read by name, that should be pretty doable.

edhartnett commented 6 years ago

Really? If I have a file with 10000 atts it's slow, but with 10000 vars its not? That seems surprising.

DennisHeimbigner commented 6 years ago

I am basing my comment off of the example that others gave me.

DennisHeimbigner commented 6 years ago

Also, based on that limited sample. 1000's of variables may not be common. Our problem is that we do not have a good statistical sample of files.

edhartnett commented 6 years ago

OK, but I do agree with you that lazy read of atts would be a good thing.

gsjaardema commented 6 years ago

FYI -- I do have many files that have 1000's of variables or more for compilicated models.

edhartnett commented 6 years ago

What would be great would be to send me the ncdump -h -s of some of the very worst offenders. (Or attach them here.) Pick your three or four slowest opening files, and give me their metadata.

Then I can create some samples and have some real examples to work with.

DennisHeimbigner commented 6 years ago

I created some programs to:

create a file with large amounts of metadata (parameterized)
time the open for a file. Both are attached bigmeta.zip We have one other file but it is apparently proprietary.

DennisHeimbigner commented 6 years ago

Greg- what might me helpful would be for you to:

select some number of such files
ncdump -h those files into .cdl files
send us those .cdl files.

[oops, duplicate of Eds' comment]

gsjaardema commented 6 years ago

Sorry for the delay. Here are a couple of bad files. Writing the testMeshQuad8Psi.e.nc as a netcdf-4 file can take a very long time.

testMeshQuad8Psi.e.nc.gz Block_8192.g.nc.gz

mathomp4 commented 6 years ago

Query for the netCDF developers. In re this issue, is #857 the one to watch out for for solving this issue?

edhartnett commented 6 years ago

I do have some progress here but I am working to clear a backlog of work and get it merged...

DennisHeimbigner commented 6 years ago

I have made some changes to netcdf4 that improve the performance about 10%. But the most significant driver for this performance problem is the fact that netcdf-4 reads all the metadata at once, which is what is being discussed in https://github.com/Unidata/netcdf-c/issues/857. Adding lazy metadata read to netcdf4 is a major undertaking and we have not decided yet on the best way to do this.

edhartnett commented 6 years ago

I have done some experiments with lazy reads of attributes. Seems very promising and easy to make happen in the code, but I didn't get all tests working.

All this code has changed a lot (with many changes pending as PRs or soon-to-be PRs) so I have to take another stab at this with a clean master.

I suppose the same technique could also be extended to variables.

DennisHeimbigner commented 6 years ago

I think the big problem that is going to cause problems with user code is that lazy operation will change the id numbers of attributes etc. It would be nice if users did not rely on those numbers, but some of them do.

DennisHeimbigner commented 6 years ago

In fact, if memory serves, some of our old test cases even use the numbers as well.

edhartnett commented 6 years ago

@gsjaardema when I download and gunzip your test files, the master branch ncdump tells me "Unknown file format." Am I doing something wrong?

edhartnett commented 6 years ago

OK, I have a branch which does a lazy read of all attributes.

No att numbers are changed and all tests pass without modification.

This works by only reading the atts (in a group (NC_GLOBAL) or for a variable) when the user first asks for some metadata relating to the attributes.

Since each group and var are still read as a unit, no numbers change. That is, when you ask for any info about any of the attributes of a variable, it will read all variable attributes at that time, for that variable. If you ask for info about a different attribute of that variable, it now gets it from memory without touching the file.

So the read of attributes is deferred until someone asks for them.

This seems to result in a speedup of about 100% for a file with 100 vars, each with 100 attributes. This is on branch ejh_lazy_atts_2 on the WDC clone. I can prepare a distribution if anyone wants to give it a try.

These changes will go into the queue for merging into master.

This is the easy part of #857. However, given that most real world files contain many more attributes than variables, this will probably be the biggest payoff from #857.

DennisHeimbigner commented 6 years ago

That is a pretty good compromise. The test_varperf probably is not a good test for testing this. We need to test it by doing a read of a random number of variables or groups and then compare again the current eager read code.

edhartnett commented 6 years ago

I have a new test, tst_attsperf, which does a test.

It shows a consistent speedup of a factor of 2. That is, it takes half as much time to open a file with lazy atts:

ed@mikado:~/tmp/n2/nc_test4$ make tst_attsperf && ./tst_attsperf 
make: 'tst_attsperf' is up to date.
testing speed of open with files with lots of metadata...
Lazy Atts time=76204 Read Atts at Open time=174782 Speedup=2
ok.
*** Tests successful!

DennisHeimbigner commented 6 years ago

That kind of speedup is definitely worth doing. Would it be possible to extend the idea to variable loading as well (i.e. basically load all variables in a group when any variable is referenced)?

edhartnett commented 6 years ago

I am looking into variables next. I will hit it first thing tomorrow and see what can be done. ;-)

edhartnett commented 6 years ago

OK, variables are a harder problem.

There are several difficulties, one is that we probably still want to read all dimensions (i.e. dimscale vars) on open, even if we don't read the vars.

Secondly, everything is processed in a giant loop that touches all HDF5 objects in the file (other than attributes, which are handled separately.) This happens with callback function nc4_rec_read_metadata_cb().

I will continue to look at this. I'm convinced there is a way to do this. I will continue with the libhdf5/libsrc4 separation, and as that happens, I will probably figure out how to do a lazy read of variables. Meanwhile we can get this lazy attribute read into master and that will help.

edhartnett commented 6 years ago

OK, I think I see a way to do this for vars as well. @DennisHeimbigner if you could avoid major edits to the libsrc4/libhdf5 directories for a few weeks, I can take a stab at this.

DennisHeimbigner commented 6 years ago

Ok, I am mostly working on thredds anyway.

gsjaardema commented 6 years ago

@edhartnett The files are the output from ncdump, so no need to run ncdump on them.

edhartnett commented 6 years ago

@gsjaardema thanks! Now I see. What a nutty file. This is exactly the kind of thing I never imagined when writing the netcdf-4 code. ;-)

I do have a way forward here, I think. I am working on this now - if I can get it working I will put it up as a PR.

edhartnett commented 6 years ago

I have made some good progress here on a new version of the open code which does not read the vars on open.

On open, I open groups and read user-defined types. Vars get a name and varid, but nothing else happens on open. So open is a lot faster now.

As near as I can tell, when the user asks anything about a var in a group, I will have to read all vars in the group, and all vars in all parent groups, up to the root group. I need to do this to catch all the potential dimensions.

This really won't help a lot of the user has 10000 vars in the root group. We will simply delay the reading of all this metadata until the user actually asks for some of it. So nc_open() will be quick but nc_inq() will get slow (the first time it is called).

However, users organizing variables into groups will see a significant improvement. If a group contains variables that are never used in a program, then the metadata for those vars need never be read.

edhartnett commented 6 years ago

I wanted to confirm that the recent lazy-att reads really do improve performance, so I did a little testing.

OK, I have removed this graph because I found some bugs in the test program. Updated graphs coming soon...

gsjaardema commented 6 years ago

Is the legend incorrect, or am I misreading the data -- looks like "Lazy Atts" is much slower than "4.6.1" ? ..Greg

On Fri, Jul 20, 2018 at 8:06 AM Ed Hartnett notifications@github.com wrote:

I wanted to confirm that the recent lazy-att reads really do improve performance, so I did a little testing. Here's some preliminary results that look promising:

[image: image] https://user-images.githubusercontent.com/4533212/43001616-12ed18e8-8be3-11e8-9039-d4c08913929b.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/489#issuecomment-406581444, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2xDmspqSNdzIhQcHRzqLa2YKGTrfQBks5uIcfegaJpZM4PeDSC .

edhartnett commented 6 years ago

@gsjaardema look again. I had the columns mislabeled on my first posting, but now it is correct.

So with 20K atts, the current master (plus all my queued changes) takes ~2 seconds to open the file, instead of 6 seconds for 4.6.1. So that is nice.

But why the heck does it take 2 seconds to open a file with nothing but global atts, when it does not read the global atts? That's a puzzle.

Unidata / netcdf-c

Help with slow open performance with netCDF4 Files #489

Environment Information

Summary of Issue