NOAA-EMC / NCEPLIBS-bufr

The NCEPLIBS-bufr library contains routines and utilites for working with the WMO BUFR format.
Other
46 stars 22 forks source link

Proposed new API for NCEPLIBS-bufr #125

Closed edwardhartnett closed 2 years ago

edwardhartnett commented 3 years ago

From @aerorahul :

Hi Jeff,

Ron.McLaren is developing an interface to read bufr data (NCEP, WMO) into the JEDI IODA system as part of the ObsProc Reengineering effort based on some simple configuration in a YAML file. We started by writing C++ interfaces around NCEPLIBS-bufr calls, and were able to make fast progress on reading simple BUFR files (files with nonrepeated data, and data contained in fixed repetition sequences). The next stage of this project will involve expanding this idea to support much more elaborate BUFR files (delayed repeated sequences that might be arbitrarily nested). Eventually we hope to support any BUFR file, regardless of construction. The goal is for the user to be able to construct a mapping with minimal knowledge of BUFR.

Unfortunately NCEPLIBS-bufr is not well suited for this task. The root problem is that NCEPLIB-bufr is not well encapsulated. The user facing interface expects an inappropriatly detailed level of understanding of the internal data structure and has an elaborate set of data retrieval calls that must applied correctly under specialized circumstances. Added to this there is an elaborate proceedure required for precomputing the number of repetitions for sequences, and this method may not work for all scenarios (especially data retrieved via the ufbevn call). In inspecting user generated code based on NCEPLIBS-bufr we found that users mostly don't precompute the expected data, but rather estimate these numbers which is rather dangerous. Many users we've talked to seem to be very uncomfortable with their understanding of BUFR and the nuances of NCEPLIBS-bufr. We believe its possible to create a simpler, safer, more modern interface, that will be easier to maintain in the long run.

None of this work is expected to impact what is in operations currently and your current work and focus is not for nothing. This is the reason to have this meeting at such an early stage so we can identify the areas that need future proofing, come up with a targeted approach for modernizing the code base and working together.

Thanks, Rahul

edwardhartnett commented 3 years ago

Transferring this converstaion here from email. @aerorahul are you suggesting a completely new set of code? Or some wrapper code for the existing library?

aerorahul commented 3 years ago

@edwardhartnett We won't be developing any new API or software in this repository, so a discussion here is moot. A completely new set of code designed from the ground up is the need as will be demonstrated in the discussions.

edwardhartnett commented 3 years ago

You will develop and maintain this new code forever? Or you will develop it and hand it off to us to maintain?

aerorahul commented 3 years ago

Obviously we are hoping to co-develop, and hence the need for discussions before we take any action.

edwardhartnett commented 3 years ago

And for maintenance?

aerorahul commented 3 years ago

If we co-develop, we co-maintain.

edwardhartnett commented 3 years ago

In terms of requirements, I see you were in C++. Is that your requirement? Or is this for Fortran code? Or something else?

In other words, what is the language of the using applications?

aerorahul commented 3 years ago

We would write the code in C++, and create fortran and python API's The JEDI application using this code will primarily be C++. But I see use for other applications who might want a Fortran or Python interface.

edwardhartnett commented 3 years ago

So basically proposing a new bufrlib, written in C++. Would this be intended to replace NCEPLIBS-bufr? Would it support the existing Fortran API?

I didn't know you guys are using C++ on Jedi.

Normally I would think C would be the natural choice for a library such as this.

aerorahul commented 3 years ago

Like I said, we are looking forward and not for existing applications.

We can take this another way, where we start putting issues here asking for the needs and interfaces and someone from this side will have to develop them and maintain them. If that is preferred, I am sure we can repoint.

aerorahul commented 3 years ago

The other alternative is to explore eccodes and see if we can co-develop with them. eccodes is not without issues.

edwardhartnett commented 3 years ago

I'm just trying to get my head around the proposal. ;-) Do you have a slideshow or anything?

Looking forward is good, but backward compatibility is also desirable. We don't want to have to maintain two completely different bufrlibs, right? That would be expensive. If there is to be a rewrite, then it would be great if it could handle existing functionality, so that existing code can be retired, and we end up with only one BUFR library to maintain.

However, rewrites are notorious. It always looks easy, but there's 36 KLOC in the existing library. Seems like that's many months of effort.

I'm reading your original email and I see some problems you have outlined with the existing Fortran API. What if we solved these problems with the existing NCEPLIBS-bufr? That would not be as fun, perhaps, but maybe a lot cheaper for NOAA?

We do also have approval to use eccodes; I would think that would be preferable to a full rewrite. What functionality do you need that is not already in eccodes?

Another thought is that it has long been my desire to make the netcdf-c library understand BUFR, as it now understands HDF4 and HDF5. Then, we could just use the netcdf APIs to read and write BUFR files. However that would have to be in C, not C++. Would you be interested in that? It would have community-wide benefits.

Can you show me your existing C++ code so I can get an idea of what you are aiming at? I would love to see a sample of how the new API will work. For example, how will it open, explore, and read a bufr file? My own understanding of BUFR is very weak, so I welcome the chance to learn more about it.

aerorahul commented 3 years ago

@edwardhartnett We have a presentation ready to show. That was the point of today's meeting. We can revisit this discussion when the group decides they want to.

jbathegit commented 3 years ago

I'm still trying to get my head around this as well, and I was hoping to learn more at today's meeting about what exactly you're proposing, and why the current library couldn't meet your needs even with a few possible tweaks and/or additional API functionality.

Bottom line is that I would caution you not to underestimate the intricacies of the BUFR format itself, nor the amount of work that @jack-woollen and I have put into this library over many years to meet the needs of the user community, both within NCEP and beyond. In other words, don't underestimate how much work it would be to build an entirely new BUFR library from scratch.

rmclaren commented 3 years ago

@edwardhartnett I do have a presentation I want to give to you guys (was going to do it today), but I wanted to go through it with you in person so i can adequately explain things (otherwise its just a bunch of slides with not much context). It would also give you a chance to interject if there is a serious flaw in my thinking (I'm new to BUFR as well, but I've been studying it and am getting pretty comfortable with the low level details). So when you guys have a chance, I'd be glad to talk.

aerorahul commented 3 years ago

All points are fair. And we don't take it lightly at all. Also notice we have not started working on anything. We have been making the necessary changes and adding interfaces to the existing library. We have been asking for help and you have been generously providing it. We have been exploring and identified some issues which, again, were supposed to be discussed at the meeting.

This was a meeting to make sure we haven't misread or don't understand the library or there are better ways, etc. I am not sure why we are getting all defensive and worked up when nothing is really decided or we haven't even talked yet. We meet and see what @rmclaren has to show.
After that, if you think the BUFR library is fine as is and easily "developable" for today's needs, we can go back to work.

edwardhartnett commented 3 years ago

OK, let's schedule the meeting for as soon as we can and I look forward to learning more about this.

edwardhartnett commented 3 years ago

OK, we had the meeting and it was very interesting. @rmclaren if you want to attach the presentation you gave to this issue, that would be helpful.

Some comments about the meeting from @jack-woollen :

Hi

My main takeaways from the meeting yesterday are below. My overall impression is that these motivations are entirely in keeping with the past upgrades in NMC/NCEP observation processing starting from 1962. The GDAS (or prior equivalents), for example, used ON20 data for 10 years, then switched to ON29/124 data for ~20 years, and from 1993 until now has used BUFR data, going on 30 years. A major software development, at least, is probably about due.

1) Thinking in the 5 year time frame, Rahul's group does not want to maintain the 30 year old fortran bufrlib, but rather would build a C++ bufrlib from the ground up.
2) A requirement for the C++BL would be backward compatibility with the F90BL. Presumably this means C++BL could read F90BL files, and vice versa. 3) Folks who are familiar with the F90BL will be needed to work with C++ programmers to make this happen. 4) No real planning to manage or implement this development was discussed.. Please chime in with any themes I might have misstated or left out.

Thanks Jack

A reply from @jbathegit :

Thanks for the summary Jack, and I'm adding Arun to the discussion.

To clarify for those who weren't on yesterday's call, the underlying point in Jack's opening paragraph is that, at some point in the future, we may want to consider using something other than BUFR for our internal tanking/databasing and archiving format. We would still need to have a BUFR library for the foreseeable future, because we're going to continue to receive incoming BUFR reports from the outside world. So we're always going to need to be able to read and process BUFR data from external sources, but that doesn't mean that we also need to continue to use BUFR as our internal storage format.

As Jack noted, that would be a major software development, and it would touch on all aspects of data ingest and preprocessing, so it would need a lot of careful thought by all of us including management.

edwardhartnett commented 3 years ago

I propose an alternative plan:

@rmclaren what this would look like is we would jointly develop a nice C API, which does not have the advanced capabilities you envision, but serves up the metadata in a basic, but complete, way. On top of that, you add your additional C++ functionality.

All language APIs would be in the same repo, and released together. So the user would download NCEPLIBS-bufr and get a (legacy) F77 API, a trendy Python API, a shiny new F90 API, a serviceable C API, and an advanced C++ API.

Proposed New APIs for NCEPLIBS-bufr

This would allow us to leverage existing code, and still provide all the advanced APIs anyone could desire for NCEPLIBS-bufr.

This is not actually a lot of work. Writing an F90 API should be straightforward. The C API is almost automatic, if we confine ourselves to simple Fortran types in the F90 API. With a decent C API, adding the proposed advanced C++ capabilities will be much quicker than a rewrite of the entire library.

Should the appetite for a rewrite remain, then a future iteration could involve hollowing out the F77 library, piecemeal, and replacing it with C or C++, all while keeping the core library, and all language APIs, working.

rmclaren commented 3 years ago

So I originally started down this path but I backtracked when I started to realize the interface layer I was designing around the NCEPLIBS-bufr interface was more complex than it would be if I just went straight to the data and read it directly. The thing is to make what I want to do work, I really have to have intimate knowledge of the structure of the data, basically down to the level of the descriptors. The changes/additions to NCEPLIB-bufr would end up being quiet extensive and I feel would take as much or more time than just taking a direct approach. The other part of this is that taking the direct approach would lead to a base of software that I feel would be easier to maintain by a wider audience of developers, and more malleable to future changes.

Toward that end I've started an exploratory implementation of BUFR in C++ to try to get at the data directly. If Jack and Jeff are correct, and reading BUFR data correctly turns out to be horrendously complicated, then your suggestion could be the fallback position.

edwardhartnett commented 3 years ago

For about the last 20 years, the consensus in the agile community is that rewrites from scratch are bad. Here's a seminal blog post which summarizes the reasons why, and gives plenty of famous examples: https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/.

A key quote, which I think is applicable here:

There’s a subtle reason that programmers always want to throw away the code and start over. The reason is that they think the old code is a mess. And here is the interesting observation: they are probably wrong.

NOAA, of course, will not fail as a software company would do. However, we can still follow industry best practices.

I would suggest a more agile, incremental approach to this problem.

jbathegit commented 3 years ago

@rmclaren could you provide a copy of your presentation slides, preferably as an attachment to this issue?

I'll be the first to admit that there are some aspects of the library that could be improved. But I'm still not convinced that a wholesale rewrite of the entire library is the best approach. The core functionality works very well and has been honed and streamlined over many years. It seems to me that some additional functionality in the form of an API sitting on top of the existing library, as @edwardhartnett suggests, could meet your needs.

As @jack-woollen noted during our first discussion, the RO data is really one of the most complicated examples of a data subtype definition that we have, with its multiple levels of nested delayed replication. Even so, I don't think it's necessarily a bad thing that the user needs to have some understanding of the details of any subtype definition. For example, where in your discussion you talked about wanting to retrieve all of the BNDA (bending angle) values, I don't see where those are of much value to any downstream user without understanding the context surrounding each value, e.g. the preceding MEFR (mean frequency) and IMPP (impact parameter) values which are associated with each successive BNDA value. And the only way to know about that context is by reading and understanding the subtype definition in the table.

That said, there may be a way to provide a new interface to the user so that they don't have to manually calculate the overall number of replications of BNDA themselves ahead of time, but rather could just call a new function to have this calculated for them so that all they need to do is allocate the necessary space and then retrieve the values they want. But again, in my mind it's a big leap from there to the point of needing to rewrite the entire library, unless I'm just not fully grasping the problem here(?)

Another point which we touched on during our first discussion (but didn't have time to address again today), is the overall question of whether, as an organization, we may want to eventually migrate our internal tanking/databasing and archive storage format to be something other than BUFR. This is the question that dovetails from the previous discussion of how, historically, we used to have Office Note formats like ON29 and ON124 way back when, but then we migrated to BUFR back in the 1990's, and now maybe it's time to consider switching to something else for our internal storage format. We'd still need to be able to read/decode BUFR since the majority of observational data from the outside world will continue to come to us in that format. But, as noted earlier, that doesn't mean that we need to continue to also use BUFR as an internal storage format. For example, maybe we just upgrade all of our decoders and ingest processes at some point so that they write out a different format besides BUFR - perhaps they could just have their own interface bindings to pass the decoded data directly into the new ODC database? That way, at some point in the future, downstream users may no longer have any need to worry about BUFR at all, and instead they could just learn to work with the data from ODC, and then the only folks who would even need to really understand BUFR at all would be the folks like me who deal with incoming data from the outside world(?)

jack-woollen commented 3 years ago

A MODEST PROPOSAL FOR ACCOMPLISHING A BUFR TO IODA CONVERTER

After reading and rereading the materials for consideration, it strikes me we might be missing an obvious point here. Keeping in mind the primary objective is to produce BUFR decoder to IODA encoder software, it seems to me using the BUFRLIB as a basis for this is not at all necessary, or even desirable. All you need to begin with, is a BUFR decoder, which is only a small, and unseen, part of any BUFR software library. Given BUFR tables and BUFR data, decoding is, as Ron rightly points out, straightforward. Rewriting (or replacing!) the NCEP BUFRLIB to obtain a suitable BUFR decoder, makes no sense at all. Decoding a BUFR subset just gives a vector of unpacked values. How you interface with the data structures in that vector, to reorganize the information in a useful way, and move it along to the destination, is the crux of the matter. All the discussion so far has centered around the need for updated user interfaces. It makes a ton of sense to start the project by writing or repurposing some BUFR decoder software, designed to work seamlessly with new user interfaces, all of which will work together to facilitate the ultimate object of doing IODA encoding from the BUFR data.

emilyhcliu commented 3 years ago

Meeting summary from April 26, 2021 NCEP NCEP LIB Refactoring Options by Ron Mclaren Two options: (1) C++ Rewrite --- on hold (2) Extend existing library with Fortran 90+ style code --- selected for development

The estimated time for extending the existing library (option 2) is approximately two months. Working branch is feature/query

rmclaren commented 2 years ago

So feature/query branch is Fortran 90 implementation of the queries which were originally proposed in the following presentation BUFR Refactoring.

How it works in a nutshell (if you're interested): Basically the user specifies a list of queries (ex: */ROSEQ1/ROSEQ2/FOST[2]) in a QuerySet object (the user may get a listing of all the possible queries by using the print_queries.x utility). The queries are then parsed (split into components) and for each unique subset encountered in the BUFR file a list of targets (consists of the BUFR node idxs plus some extra meta data like the inherent dimensionality of the data) is compiled by searching the BUFR table data. The list of targets (for a subset) is then used by the data collection process to find and record the data for each target together with some other meta data (like the repeat counts for each repeated sequence). The results for each message subset is recorded in a ResultSet object which accumulates DataFrame objects for each message subset (data found for each target, repeat counts for each repeated sequence, other meta data..). Once finished the ResultSet object has all the data for the BUFR file represented in a dimensionaly invariant way (data + counts at each repetition level etc...). When ResultSet getRawValues (ex: call result_set%get_raw_values("heightOfCloudTop", data, dims, dim_paths=dim_paths)) the data+counts are used to inflate the data (the maximum count at each repetition level over all DataFrames defines the result shape) by strategically inserting missing values so that you can reshape the data into their appropriate representations (1d, 2d, 3d, 4d, etc... rectangular (not jagged) arrays). Sometimes the user will want to pivot the data (lets say the Lat Lon you care about are not on the trunk of the tree but on one of the branches instead) so it is possible to include a groupby field to group captured elements in a result set by other elements. Applying this changes the dimensionality of the result (lots more details here as well).

Implementing the above in Fortran proved to be extremely painful... Whats worse is that I've found its very difficult (for a variety of reasons) to extend the code and make changes (managing complexity of this type is hard in Fortran [no generic collection objects, inconvenient syntax, predilection to issues with memory corruption... so on]). It's for the long term maintenance aspect of the problem that I went ahead and re-implemented the query interface in C++. The implementation of the C++ query interface was moved to the ioda-converters project (feature/query_cxx branch) which is being managed by JCSDA. This has several advantages. For one this means that the changes needed to NCEPLIB-bufr will be very minimal, so the team that manages it won't need to worry about maintaining any of the query code. The JCSDA team has a larger pool of programmers to draw from (especially for C++). This also means the ioda-converters BUFR translation capabilities will be able to evolve on its own much more quickly because it won't be coupled as strongly to NCEPLIB-bufr (changes there should be very rare if at all). The disadvantage is that this component in ioda-converters for querying data will require a more intrusive interface to NCEPLIB-bufr in order to work (it needs access to internal data structures there). The required changes are captured in NCEPLIB-bufr [feature/query_cxx].