Unidata / netcdf

NetCDF Users Group (NUG)
MIT License
6 stars 10 forks source link

What is netCDF? #50

Open lesserwhirls opened 4 years ago

lesserwhirls commented 4 years ago

Given recent discussions regarding additional C library features (for example, Unidata/netcdf-c#1545, Unidata/netcdf-c#1546, Unidata/netcdf-c#1548) a question that has been on my mind quite a bit is “what is netcdf?” Perhaps others are wondering as well. Here is my take, with the understanding that I started using netCDF in roughly 2004. If my take does not matter, that's ok - feel free to skip to the final paragraph. Note that none of this is intended to be a dig against any individual, project, group, etc., but rather outline a limited set of concerns regarding the current state of the netCDF ecosystem and how it moves forward, as well as a suggested path forward based on approaches that seem to work pretty well in other, somewhat similar situations (although not exactly the same by any means).

One take

Up until the addition of HDF5 as a persistence format (and the only persistence format supporting the extended data model), the situation seemed pretty clear, as the persistence formats were well defined, such that anyone could implement read/write without much ambiguity (and many did). I believe this helped gather a large, language agnostic community around netCDF and the associated persistence formats. The netCDF Data Model provided a way think about and access data in a relatively straightforward way, and netCDF files were a stable, portable container that was usable across multiple computing environments, including languages with different memory ordering (C and FORTRAN, specifically).

With the addition of HDF5 as a persistence format, as well as the extension of the netCDF data model, it feels like the answer to the question of “what is netCDF?” is less clear. The overall tone I get from the recent discussions mentioned above, and other similar github issues, is that netCDF, the data model, and persistence formats, are defined to be what the C library implementation supports and enables (as intended by direct implementation or unintended side effect of changes to the HDF5 library), and all other implementations need to keep pace. Honestly, that’s a perfectly valid choice, but not one that I ever considered was an option for this particular community driven endeavor. It's also not something that can be clearly found in the netCDF Users Guide, at least as far as I can tell, although one could argue it is hinted at in places (I'd say one gets a mixed message, at best).

netCDF-C is not the only library that attempts to read and/or write netCDF-4 files. The NCAR RAL developed Nujan library is a pure java writer for netCDF-4 and HDF files. More akin to netCDF-C, h5netcdf is a python library that both reads and writes netCDF-4 files using h5py (which wraps HDF5), and it also does not use the netCDF-C library (although it uses hdf5 python wrappers). Based on interactions with h5netcdf, I’d say maybe yes, the persistence format specification for the Enhanced Data Model can only be understood in light of netCDF-C implementation details, and really only in combination with the particular version of the HDF5 library in use by netCDF-C at the time of persistence. I know there are folks that would strongly agree with this view, and others that would not with equal intensity, but the current state of things with regards to the persistence format for the netCDF Enhanced Data Mode feels ambiguous at best.

With the C library addition of arbitrary filer support for use in writing, netCDF-4 as a persistence format has taken a concrete step towards GRIB in that now you need an open ended set of filters to read the actual data encoded by an application’s persistence layer (analogous to the way one can always read the octets of a GRIB message, with the question always being whether or not they can be understood and/or actually used). So now we have github issues discussing which filters are supported by netCDF (for the Enhanced Data Model persistence format HDF5), and the question is how are those filters defined? So far, for netCDF (specifically from the point of view of the C library), I’ve heard two methods:

Both of these revolve solely around the netCDF-C library. The CCR will help reign in the possibilities with respect to filters (which is great!), but because compression is not explicitly specified as part of the netCDF Enhanced Data Model, and I assume the netCDF-4 persistence format spec is will not place limits on options now (currently the spec only lists zlib for writing), the options for writing now appear truly open ended (functionally making the default data type for a compressed variable in netCDF-4 opaque until a decompression technique is obtained). None of this is to say new compression methods for use in netCDF persistence are not needed. One such need driven by the operational community is clearly outlined here. However, we find ourselves in the situation where that while yes, netCDF-4 files produced by the C implementation are portable, they are not necessarily usable in the most basic sense, and therefore I would say not well suited for exchange or archiving as things currently stand, at least generally speaking ("generally" being outside of the current C library with specific filter support enabled). If you can’t read the actual data values, even with all of the metadata in the world, then what’s the point? Unless of course you can ensure the people you share your data file with use the exact same configurations and versions of netCDF-C and HDF5 libraries, plus any additional filters, used to produce the file..and all of these configurations/versions would be analogous to one or more GRIB tables. One could argue this was always the case for netCDF-4, and even netCDF-3 when combined with szip. I do not think that makes the where we currently stand ok, and in this case points to a deficiency in how well defined the netCDF-4 persistence format is. Given that the spec is currently written to be "sufficient to allow HDF5 users to create files that will be accessible from netCDF-4", I suppose it's not intended to be complete. One question would be, should it? And if so, does it belong in the C documentation, or elsewhere?

These thoughts lead me to ask, what exactly is netCDF?. Is the netCDF Enhanced Data Model and its currently associated single persistence format precisely defined to be anything a properly configured netCDF-C library will let you produce or consume (in addition to any unexpected side effects produced by the specific version of the HDF5 library in use, such as the under the hood change in the default superblock version). In that case, none of this particular github issue matters, as any other writer OR reader of data persisted in the current netCDF-4 format is, at best, a one off that should be avoided for real work. Do users of these libraries get what they deserve? If so, let's warn them and make if clear (I'm certainly not advocating this position, but if that's it, it should be put it in writing).

Where things really get hairy, and where these kinds of decisions regarding enhancements will have an incredibly wide spread impact even for the netCDF-C developers, will be when additional persistence formats options are added for the Extended Data Model (Zarr, for example). You can bet once the way in which the netCDF Enhanced Data Model is persisted using Zarr is finalized, it will only be a matter of time until a full pure python read/write implementation emerges, as well as other implementations that do not rely on libnetcdf (currently the plan for netCDF-Java does not involve relying on the netCDF-C library for this). I also see the likelihood of a JavaScript based full netCDF reader for the Zarr persistence format coming to life, and that will open up a whole new world of possibilities for getting data into the hands of decision makers, assuming the actual data chunks are readable. Once this happens, it will be critical that the development and enhancement of netCDF take a holistic view of the world, or draw a line in the sand, so to say.

When deciding how to advance netCDF, at least at the present time, there seems to be a tendency to think along the lines of “I think we should add feature X because we get it for free by using the HDF5 library”. When that feeds back into proposed changes to the Data Model, what that really says is, “I think we should let the features of a particular persistence format drive the data model”. If the only persistence format for the Enhanced Data Model is netCDF-4 (using HDF5), then you're good to go. What then if a feature from one persistence format ends up in the Data Model, but isn’t supported by another? Do we have a difference in Data Model features that can be persisted based on the choice of persistence format? Should netCDF files be limited to only those formats that can persist the data model? As I understand it, with Zarr as a persistence format as things currently stand, the answer is yes - the Zarr specification does not support all of the pieces of the Enhanced Data Model at the current time, so it will be an incomplete persistence format (at least initially). This has the potential to put users in a highly fragmented ecosystem. “Yeah, it’s a netCDF-4 file, but can only be read by the python stack because they used the python writer, which used X feature of Zarr, and the C Zarr reader does not support it yet.” One response would be “well, that’s not a real netCDF file”, but that will only work for so long.

Without a language agnostic way to move netCDF forward in innovative ways (data models and persistence layers alike), what is real, and how useful real actually is in practice, gets defined by the dominate writers...again, in many ways, just like GRIB. Already h5netcdf has the ability to write netCDF-4 files that are not readable by the C library, although not be default. Thankfully, h5netcdf code removes the _NC_PROPERTIES from those files, and will shout loudly at the user that they are doing something that is not "netCDF" (and in the future will throw an error), although is allowed by using invalid_netcdf=True in the file creation API call. Afterall, they are still valid HDF5 files (there is a GRIB analog with that, but I'll leave it be for now). From the users point of view, if they get a file with a .nc extension, and they use that library to read it in, they would never know the difference. And if they share that same file with someone who uses netCDF-C, the support question will end up here. Now these unsupported features might be perfectly nice to have, but how would h5netcdf developers go about proposing them - as a PR to the netCDF-C library?

Suggested Path Forward

I don’t believe that moving netCDF forward can be achieved in header file or a PR against the netCDF-C repository alone. Given the current state of netCDF, I feel we are very much in need of something like a netCDF Enhancement Proposal framework, following the likes of Python Enhancement Proposals or Java Specification Requests (and, I would argue, their community driven approval process as well) to prevent further fracturing. This can, and should, be an entirely separate discussion, but I'll throw it out there. However, it all depends on the question, what is netCDF?

Circling back

I suppose a good first place to start would be with the following question:

Is the netCDF Enhanced Data Model and its currently associated single persistence format precisely defined to be anything a specific netCDF-C library (specific to the set of configurations used with said library and dependencies) will let you produce or consume?

edwardhartnett commented 4 years ago

Sean, great discussion. I would like to start a separate issue about szip to discuss some of those specific points and how they bear on the software directly.

The answer to your thesis question is a (very carefully and thoughtfully crafted) sentence from the web page:

NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

This is reflected in the backwards compatibility guarantee, which applies to both data files and C/Fortran code (not sure about Java - is backward compatibility fully supported in all releases? It was not when John was there - perhaps it is now).

Old files will always be readable. Old code will always continue to work and produce the same results.

This does not prevent, and never has prevented, new features being added, it does imply that we have to add the new features everywhere.

These challenges do not just arise in the HDF5 format. The CDF5 example is instructive and a great model for progress.

CDF5 was driven by operational requirements. Pnetcdf is popular, so plenty of data producers would have used CDF5, whether or not Unidata kept up. The alternative to CDF5 is not CDF1/2, it is binary files, because the limits of CDF1/2 had already been exceeded by many science codes. CDF5 was a great advance, and making it part of netCDF was a great decision. It strengthened the ecosystem because the two projects collaborated to bring the benefits to all users.

Similarly, if netCDF/HDF5 was not available, other giant data producers would not be using CDF1/2 or even CDF5. Instead, they would have just switched everything to HDF5. Their decisions are driven by operational requirements. If it takes more programming to convert all their code to HDF5, they will hire another 100 contractors for a measly extra 10 million dollars.

Unidata users, on the other hand, would be in trouble. They just have to deal with what NASA, NOAA, and ESA are putting out. Because we wrap the benefits of HDF5 in netCDF in a way that does not violate our backward compatibility pledge, Unidata users can now use these data sets easily.

But now we are once again hitting limitations that were unthinkable 20 years ago. Never did I consider that one day, NOAA would be operationally attempting to compress and write files of 10s or even hundreds of GB. But here we are.

The answer is to move forward together, not to attempt to hold netCDF back. When a better solution becomes available within our ecosystem, and there's a strong need for it in the community, we must embrace it and provide it to netCDF users.

It is widely agreed that compression is an issue, and HDF5 has some well-developed solutions. Let's work together to bring those solutions to all our users. Otherwise large data producers will start producing important data files that cannot be read without extra filter installation efforts.

edwardhartnett commented 4 years ago

@lesserwhirls to specifically address the question of a single software stack to read HDF5 files.

Users currently have have a choice between the simple binary format, and the much more complex (and evolving) HDF5 format.

Large data producers have clearly chosen HDF5, either directly, or through netCDF. Their operational requirements drive the decision. It's not a choice between HDF5 and simple netCDF classic. It's a choice between HDF5 of a similarly complex custom format. The complexity is not accidental - it's there for performance.

The HDF5 format is not secret or proprietary. It's open, as are the software libraries. As long as the implementation is open-source, that meets requirements at NASA, NOAA, etc.

DennisHeimbigner commented 4 years ago

We also need to take into account the netcdf API (as defined primarily by netcdf_xxxx.h files). Ever since DAP2 support was added, a goal has been to allow users to access other data formats as if they were (subsets of) the netcdf API. So in a sense, the API defines "what is netcdf". The nczarr implementation is a good case in point. It started with the existing zarr cloud format and extended it by attempting to make the minimal changes necessary to support the netcdf API. The result is a cloud format that is a superset of the zarr cloud format but designed so that existing zarr code can read the content as long as it ignores the nczarr extensions.

lesserwhirls commented 4 years ago

The answer is to move forward together, not to attempt to hold netCDF back.

100% agree. But I'm still trying to figure out what netCDF is, exactly, to understand what holding it back would be. For simplicity, we can limit our consideration of netCDF to the point of view of "netCDF-4 the file" (there is a Spaceballs meme in here somewhere). Is netCDF-4 the file only fully understood from the point of view of the C library?

The nczarr implementation is a good case in point. It started with the existing zarr cloud format and extended it by attempting to make the minimal changes necessary to support the netcdf API.

And similarly for the netCDF-4/HDF5 relationship? If so, maybe now I'm beginning to see my internal mismatch. I have viewed netCDF-4 files as well defined, standalone containers of data, and have been desiring a specification that includes a well defined subset of HDF5 file features utilized by netCDF-4 files, but that's not quite the right way of looking at things. In this case, netCDF-4 is any valid HDF5 file that follows a set of conventions with a few restrictions, and those conventions and restrictions are in place such that they allow HDF5 files to be read by the netCDF-C library (and associated wrappers)? That is, netcdf-4 isn't so much a file format based on HDF5, but rather an HDF5 convention?

edwardhartnett commented 4 years ago

@lesserwhirls you are correct about netCDF and HDF5.

NetCDF can read and understand almost all existing HDF5 files, and that is intentional. There are things now in the netCDF API (NC_STRING, for example) that are only there so that existing HDF5 files can be understood by netCDF. NetCDF-4 is a set of HDF5 conventions that allows us to express shared dimensions in HDF5. (Otherwise the models match pretty well and we need no conventions.) Even without those conventions in place, netCDF can do a pretty decent job of reading HDF5 files.

A big feature is that existing netCDF codes can be applied to existing HDF5 (and HDF4) data sets.

WardF commented 4 years ago

Thanks for starting this discussion @lesserwhirls! One thing to keep in mind as we move forward is the work we are doing to uncouple the data model from data storage. The Zarr/object storage access work is a great example of this, and us needing to address community concerns that were unthinkable a decade ago (as @edhartnett pointed out in a different thread, much has changed!). The idea that a particular data model requires a specific storage format is no longer necessarily true, and this project will provide a roadmap for adding additional storage options in the future. This isn't to say we are dropping HDF5 support, but rather reflects the broader needs of the community, and they want alternatives to the storage option(s) currently offered by netCDF. So what is netCDF in this context? It is the data model and not the underlying storage format. This will also let users select the storage format that works best for their purposes, be it HPC applications, cloud access, local storage, etc.. It will also let us select a subset of functionality offered by underlying libraries/storage formats, as dictated by the needs and expressed desires of our community. There's no need to try to implement the kitchen, bathroom and utility sinks when our community has asked only for the one in the kitchen!

This also means we have to make sure we think carefully about implementation details, as @lesserwhirls points out. It is relatively easy to focus on the C library (and I've been guilty of this in the past) without giving due consideration to the C++/Fortran interfaces, or considering what the impact will be on projects like netCDF-Java, netcdf4python, NCO, etc. This risks creating a schism between user bases that might not be quickly resolved. While there are certainly situations where this could be acceptable, we need to maintain a measured approach and make sure we think it through for even relatively small changes. I have nightmares about the unintended consequences of making a small change with negative effects, as it ripples through the netCDF ecosystem. Or I think about the 1.10.0 HDF5 release, and how for a few days we had 'netCDF4' files that only some netCDF4-capable software would be able to read. Fortunately, we were able to knock out the changes on our end pretty quickly, but that was a pretty tense time.

Regarding the Suggested Path Forward, it bears some thinking about, and I imagine it will come up at the netCDF team meeting tomorrow. Let me think about this more; I'll also follow up here as I organize my thoughts. Taking a break from preparing EGU materials, and thought I'd respond to this thread in the meantime. Hooray for drafting abstracts ;).

Dave-Allured commented 4 years ago

@lesserwhirls, I also thank you for starting this discussion. I think file compatibility needs to focus on policies. Other conversations were bouncing around between policy, implementation details, and other amazing things, partly with my help.

What is netcdf? I agree with the overarching description that Ed quoted from the User Guide. However, you asked to focus on persistence formats, which I think means physical file formats. So for discussion, I offer a deliberately broadened definition to accommodate variations within reason. I am avoiding opaque storage through translation layers, such as opendap and cloud; it seems those are different topics.

Netcdf file format may be considered to be approximately conforming to the limited set of published and recognized specifications of known netcdf file types. Currently I see two official types: the netcdf-3 family, and the netcdf-4 family as an overlay or wrapper on HDF5 format. As noted, a third type, nczarr, may be up and coming.

By "approximately conforming" I mean that internal structural conventions are the same or similar to the official known specs. Examples of variants for netcdf-4 would be a new filter, multiple filters, or a newer HDF5 internal object version. An example for netcdf-3 would be adding new data types, as was done for CDF5/pnetcdf.

For "approximately conforming", file identification and the overall internal file structure would remain the same as in official versions. The key here is that a new construct inside the file would not make such file "not netcdf" under this broad definition. I would call it "a netcdf file with a new object type" or a new filter type, or whatever. But if it looked and smelled like netcdf, and the "file" command said netcdf or HDF5, and hackers -- er, I mean smart people -- could drill and parse through it like netcdf, I would still call it "netcdf" with a variation. But lines can appropriately be drawn between "official" versus "unofficial" or "experimental" netcdf.

Unidata is concerned with file compatibility for the broad user community. This is good. As a way forward, it would help to think about this in terms of tiers of support or sanctioning for existing and future format variants. This follows some natural competing purposes, such as broad format compatibility versus high performance. Here is a suggested list to start.

The lines are fuzzy. The first category would be considered best for universal compatibility. The next two might be candidates for high performance. The last would be unsanctioned variants to be used at your own risk. Good variants would be promoted when they meet some degree of approval, for example full szip support. Variants might also be distinguished here, by subsets of a data model, such as CDM versus Extended.

Please consider this as one way to manage variants, evolution, sanctioning, and diverse needs.

lesserwhirls commented 4 years ago

Excellent discussion all! I have a feeling this thread will be useful far into the future.

From what I gather at this point, a netCDF-4 file is not something that can be understood apart from the specific version of the HDF5 C library used to produce it. That is to say, it's not enough to say "it's a netCDF-4 file, open it up, and get your science on!" That's a very different view than I previously had for what it meant to be a netCDF file. With the other netCDF formats, and certainly during my time in graduate school, it felt sufficient to say "the data are in netCDF" and we knew we could access the data and do the work we needed to do (metadata issues aside), using any number of libraries or tools (built upon the netCDF-C stack or not).

Circling back, with the scope of the question "what is netCDF" limited to netCDF-4 files (I promise we will return to that larger question), is it both necessary and sufficient to define a netCDF-4 file as any valid HDF5 file that conforms to these conventions?

lesserwhirls commented 4 years ago

I should have refreshed before submitting, as I just saw @Dave-Allured's excellent reply. I completely agree that we need a way to mark experimental formats, otherwise there is no moving forward without breakage. This would be a great candidate for the first netCDF Enhancement Proposal. I believe if we had a netCDF Enhancement Proposal framework, we would only need two (maybe three) categories for data formats:

We'd need to discuss what it would mean to be accepted as a format through the netCDF Enhancement Proposal framework. What would it take to, say, move from the third to the second category. I think part of that acceptance should include at least two independent implementations going forward (which ncZarr would have pretty quickly between the python and C implementations). That said, I think we should be very cautious, and very precise, when adding new formats.

edwardhartnett commented 4 years ago

@lesserwhirls It is quite easy to say "it's a netCDF-4 file, open it up, and get your science on!" That's what thousands of users do every day. There are no exceptions or conditions - netCDF guarantees full backward compatibility. A scientist who wrote data files with netcdf-c-4.0 can still open them with netcdf-c-4.7.3. And the same is true for HDF5 files.

If anyone wants to fund an independent implementation of HDF5, they may. John Caron, working alone, wrote one. But no one else sees the need. Even having an independent implementation, you guys do not develop it to keep up with changes in HDF5.

A fully independent implementation is a new requirement which, IMO, is not a wise requirement.

NetCDF-4 would not have met this requirement, and still does not, although it is extremely helpful to scientists around the world.

dopplershift commented 4 years ago

@edhartnett The existence of h5netcdf would seem to disprove your point. This means there are two additional implementations of support for HDF5-based netCDF, beyond netCDF-c: h5netcdf (reading and writing) and NetCDF-java (read-only). In the days of netcdf3, I know of 3 independent implementations within Python alone. When Zarr becomes a supported format, I can promise you that the use of this format will be based around an independent Python implementation, not netCDF-c.

So I would argue that the idea that the world doesn’t want independent implementations of netCDF is demonstrably false. In fact, I would say that the complicated on-disk format HDF5 may have reduced the feasibility of independent implementations, but they have made them no less desirable. Yes, this makes our world more complicated, but that’s just too bad.

Also, I’m not sure how you have an “archivable” data format, without a full, software-independent specification. And I’m not sure how you validate this specification without having independent implementations of this specification.

edwardhartnett commented 4 years ago

HDF5 does not seem to have an independent implementation. It would be great if it did, but it does not (as far as I am aware).

Despite this, there already are tremendously important Earth science archives of HDF5 data. These will continue to grow.

John Caron developed a full independent implementation of HDF5 in less than a year, so I guess you can have an independent implementation if you want one...

Sorry I don't have any help to offer here. Although this would no doubt benefit all netCDF users, I am particularly concerned with operational forecasting efforts, which are all about C and Fortran. ;-)

lesserwhirls commented 4 years ago

I am particularly concerned with operational forecasting efforts, which are all about C and Fortran. ;-)

To get data out of a numerical model and onto disk, yes. But, take one step further. Send those files out to forecast offices where they will be ingested into an EDEX server for access via AWIPS CAVE. From what I can see, EDEX uses netCDF-Java to decode the data.

Even having an independent implementation, you guys do not develop it to keep up with changes in HDF5.

As a team with an effective FTE less than one (of which I am grateful that we have that, as other implementations are not so fortunate!), my priorities for HDF5 support in netCDF-Java are to maintain, at a minimum, what is needed to read netCDF-4. Second to that, we will try to address missing features in our HDF5 support as our users encounter them. The issue with that approach is that, as far as I can tell, there is no way to separate out what features of HDF5 are needed to read netCDF-4, outside of supporting read for everything enabled by the choice of superblock used when writing (because the netCDF-4 on disk spec punts to HDF5). NetCDF-Java currently support features up to and including superblock 2.

edwardhartnett commented 4 years ago

When first released, netCDF was completely incompatible with all other Unidata software and ways of doing things. Fortunately, at that time, Russ did not tell Glenn it should not be done. ;-)

When I wrote netCDF-4, no committee ruled out compression because it was not present in other implementations. And that's good, because compression has turned out to be one of the most popular features of netCDF-4.

No one prevented Dennis from adding new features to the DAP code. Russ never told John that he should not add IOSPs to netCDF-Java, though they certainly violate the philosophy of this "What is netCDF" thread.

And so we have the modern netCDF.

Russ did good work not forming committees and not telling us what software to not write! That was first rate technical leadership.

I trust that kind of thinking still has a place at Undiata and on the netCDF project. The cautions raised in this tread are valid and need to be considered. But these cautions should not be allowed to stop all new features. They should be weighed with each new idea, but not weighed so heavily that every idea must sink.

New development will not stop. If you attempt to dam the flow, you will simply isolate netCDF, and the features will be developed elsewhere. As the years go by and the great new features are added to other packages, then users will gradually switch away from netCDF, just as the giant data producers would have already switched away if netCDF-4 had not provided the new features they needed before they built the giant software systems that now depend on netCDF.

What is netCDF? It is a bit of a chameleon, changing to suit user needs in different eras. Historically is has grown disruptively, but with backward compatibility. Each advance has been a big leap, rather than an incremental improvement.

No one knows what great new features many be proposed tomorrow, and be considered an essential part of netCDF 10 years from now, just like all the features I've listed above. No doubt these new features will cause disruption and extra work. It's our role and privilege to cope with that work, and I'm quite confident we can continue to do so.

I hope and believe that working together is the best way to meet and surmount these difficulties, and ensure that the science world receives the most benefit and best I/O features from netCDF. I hope that netCDF can continue to incorporate the best ideas from it's active and energetic user base, and to remember that we do not have to have had an idea first, for it to be a good one.

dopplershift commented 4 years ago

@edhartnett I appreciate your fervor and enthusiasm for seeing netCDF advance and continue to serve its user base, as well as your continued contributions to see its use expand, especially for NOAA. I feel your chosen examples of netCDF history miss a few things:

netCDF-4 has been so successful that other users are figuring out how to create netCDF-4 formatted data without using the netCDF-C library. The problem with that is that due to the lack of a netCDF-4 specification, users can create files that can reasonably be considered valid netCDF-4 files but are unable to be read by any of the predominant, Unidata-based netCDF library implementations. Do we want netCDF-4 files in the wild that require people to update their libraries to use? Or conversely, should there be valid netCDF-4 files that aren't readable on e.g. CentOS 6? The best case scenario here is that we end up with odd bug reports. Worst case is we end up like GRIB with a plethora of variants of the netCDF-4 format. How is that serving the best interests of the netCDF community?

I'm not proposing blocking any or all advancement. What I am proposing is we need a document that defines what makes a valid netCDF-4 file. Zarr has a spec, ncZarr has (or will have) a spec. netCDF-4, the file format, needs a spec--one that defines it completely, including what features from HDF5 are allowable. That document should then be updated through some process to allow an orderly advancement that doesn't create confusion for our users.

edwardhartnett commented 4 years ago

@dopplershift the good news is the document you propose for netCDF-4 has already been written. It is hosted by NASA here: https://earthdata.nasa.gov/esdis/eso/standards-and-references/netcdf-4hdf5-file-format, the standard document is here: https://cdn.earthdata.nasa.gov/conduit/upload/497/ESDS-RFC-022v1.pdf. A completely up to date explanation of the performance enhancements I added to this can be found in my 2018 AMS paper "NETCDF-4 PERFORMANCE IMPROVEMENTS OPENING COMPLEX DATA FILES" found here: https://ams.confex.com/ams/2019Annual/webprogram/Manuscript/Paper350021/NetCDF-4_Performance_2018.pdf

I will leave it to you guys to work up similar documents on the Zarr format. I agree that you have some documentation work ahead of you there.

It seems unlikely that anyone is having trouble producing a netCDF compatible file with just HDF5. NetCDF-4 can read almost every HDF5 file without modification, so creating one that netCDF cannot takes some real effort. Anyone who is having trouble creating netCDF-4 files without the netcdf-c library should contact me with any questions or problems.

As always I welcome contributions and help with netCDF and related packages. Please let me know if there is anything I can do to help you in your understanding and documenting what has already been achieved with netCDF.

And there is much more to come!

I hope we will have some good new compression options available from the CCR this year. Hopefully this will bring some long-needed new compression algorithms to the community. Interested users should follow progress here: https://github.com/ccr/ccr

The PIO library is also gaining users, providing the scalable HPC capabilities that netcdf-c lacks. PIO combines all known netCDF flavors, allowing the user to transparently switch between netCDF classic, netCDF/HDF5, and paralllel-netcdf. The upcoming PIO release will include more compression and better integration with netCDF, allowing existing netCDF C and Fortran code to use the PIO library with only minor modifications. For more on PIO see: https://github.com/NCAR/ParallelIO.

There are many exciting developments in the netCDF world!

edwardhartnett commented 3 years ago

CCR has done a 1.1.0 release, which includes:

See https://github.com/ccr/ccr for more info or to get the release.

Here's why these will be popular (figures from a poster Charlie and I are presenting at AMS in 2 weeks):

image

image

Some things to note:

Why do these matter?

The final size matters for long-term archival purposes. Many NASA missions, for example, use AWS S3 for backup storage. All their data gets uploaded to S3 to meet NASA requirements for an off-site data backup. Reducing that by 50% would save 50% of the money that AWS charges to the project. Since the data is backup only, slow read/write time is acceptable.

In other cases, it's the speed of write/read that matters. NOAA produces the massive UFS output files using parallel I/O with zlib, and by doing so they meet their operational time budget. But when resolution increases again, they may overrun their time budget to write the data. That would be very challenging to fix by reducing the stored data, but Zstandard gives them a drop-in replacement which will be 10 times faster. That will give them plenty of headroom for further resolution increases.

There has been some trepidation expressed by some at Unidata with respect to these new compression options. I suggest a pro-active approach which supports the needs of data producers to cope with the ever-increasing size of netCDF datasets.

I will conclude with this sobering graph from NASA: image

Faster and better compression is needed. The CCR project provides it, but netCDF-Java and netCDF on windows are going to need help keeping up...

dopplershift commented 3 years ago

@edhartnett Really impressive work, and it's a huge :+1: from me to improve the quality of what netCDF offers its community. My only trepidation has been (what seems to me to be) a laissez-faire attitude about including such features in the library. In my opinion, features like this need to be thoughtfully considered as an addition to a netCDF standard--including a risk analysis with regards to implementation and support. Then, and only then, should such features enter into netCDF-C (with support for all platforms already supported) and netCDF-java.

This does not inhibit proactive features (just look at what's going on with the team's zarr work), but does require a methodical approach with all the time penalties that result. That IMO is our duty as stewards of such an important part of the scientific software stack.

DennisHeimbigner commented 3 years ago

I too would like some clarification. I have been assuming that CCR would provide the source (and possibly some binaries) for these compressors. Then users would put the binaries in their HDF5_PLUGINS_DIR so that netcdf could make use of them. However, I am not sure I understand where, for example, the implementation of nc_def_var_bzip2 resides; can you elaborate? Also, I would prefer a different name for these functions. How about something like nccr_def_var_bzip2?

edwardhartnett commented 3 years ago

@DennisHeimbigner the nc_def_var_bzip2() function is in libccr, which must also be linked to the user application. The CCR also provides the source for the plugins, which are built and installed in HDF5_PLUGIN_PATH as you note. If the user wants to use nc_def_var_filter() directly, then they don't need to link to the CCR library and they don't need nc_def_var_bzip2(). CCR installs the plugins in the normal way, and they can be used by HDF5 without CCR. The CCR function names are designed to fit well with the netCDF API. Since we've already done two releases, it's a bit late to change the names.

@dopplershift I'm sorry can't take the time to help netcdf-java, but there's only one of me. ;-) I understand you are getting some help from John Caron again. That's great news, and I strongly suggest you start taking a look at zstandard compression in java. According to this post, there is a complete Java port: https://stackoverflow.com/questions/60974402/lz4-and-zstd-for-java.

I believe that the CCR obviates the need and desire to put these directly into the netcdf-c library, which was my original suggestion, so it seems that's no longer a concern. Since filters are already accommodated by both netCDF and HDF5, we are all good to go. Much thanks to Dennis for all his hard work getting filters, and multiple filters, to work well, and, of course, to the hard working programmers at the HDF Group.

lesserwhirls commented 3 years ago

Echoing @dopplershift, the results you have shown in your post are great! However, this is about more than netCDF-Java (I have that covered), as you have suggested that netCDF-C on Windows will need help "keeping up" as well. And even then, this is about more than compression. The trepidation from my end is locking out entire portions of the ecosystem without even having a discussion or a plan. Now that I have a list of three specific filters (bzip2 compression filter, BitGroom lossy compression filter, Zstandard compression filter - none of which appear to be show-stoppers), I have a shot an getting them into netCDF-Java...but what about the windows support on the C side? What if there isn't a straight forward path for Windows support without major work? Was that topic discussed somewhere? Again, echoing @dopplershift, this is our duty as stewards of such an important part of the scientific software stack.

edwardhartnett commented 3 years ago

@lesserwhirls and @dopplershift I'm delighted with your sense of stewardship towards this vital software stack, into which Glenn, Russ, Steve, John, myself, Dennis, and Ward, and many others have poured much time, effort and care. Let's continue to work together to preserve and enhance that legacy.

Not sure what you mean by not having a discussion or plan. Seems like we've talked this topic a lot. This issue was started almost exactly a year ago! However, I'm always happy to answer questions about my work, so we can keep talking it out as long as you need to. We have a poster and paper for the AMS next week, which further discusses CCR in detail.

In terms of planning, CCR was proposed by Charlie in 2019. We told everyone our plans on the CCR website, a year ago, and we have been doing what we said we would do. Seems like we have a good plan and are following it well.

A Windows port of CCR and all it's filters is on the list of things to do in the New Year. I'll start with a CMake build and see how far that takes me. I agree that a windows port is a requirement.

lesserwhirls commented 3 years ago

Not sure what you mean by not having a discussion or plan. Seems like we've talked this topic a lot. This issue was started almost exactly a year ago!

As I said, this is about more than compression (and by this, I mean this specific github issue). It just happens to be the case that discussions around compression led me to create this issue. My trepidation is more general than the question of new compression methods being potentially added to the netCDF-C library (e.g. bumping HDF5 H5F_LIBVER_LATEST versions, adding new persistence layer formats, etc.). My comment of "locking out entire portions of the ecosystem without even having a discussion or a plan" was meant to be more general. The CCR work is great, and definitely solves a problem. My concerns are not at all directed at the work that has been and is being done there. CCR intentions and plans have been communicated, although sometimes lost in the background signal of all of the implementation specific related issues on this repository.

A Windows port of CCR and all it's filters is on the list of things to do in the New Year. I'll start with a CMake build and see how far that takes me. I agree that a windows port is a requirement.

Excellent news! It was not clear that the work was planned, and the CCR repo didn't mention it until 20 minutes ago (and thus my last question asking if that topic, more specifically, was discussed somewhere).

DennisHeimbigner commented 3 years ago

The issue for me is a namespace issue. Since C does not have C++ style namespaces, it is important to be cognizant of what names you give to functions. The nc_dev_var_XXX names technically intrude on the netcdf-c namespace, so it makes me unhappy.

Re the Java issue. It is possible to use these compressors via the JNA implementation. But assuming that in many cases you took the compressor implementation from elsewhere, you might see if those same sources by any chance have Java implementations and publish a pointer to any that you find.

edwardhartnett commented 2 years ago

This was an interesting discussion, but is not really a netcdf-c issue. I recommend this issue be closed, or moved to the discussion section of GitHub.

dopplershift commented 2 years ago

The issue could be transferred to Unidata/netcdf if @WardF agrees.

WardF commented 2 years ago

Agreed.