Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.24k stars 413 forks source link

To what extent should MetPy try to ensure CF units/UDUNITS compatibility? #1362

Open jthielen opened 4 years ago

jthielen commented 4 years ago

This comment by @rsignell-usgs, combined with issues like #1134 and #1350, led me to wonder: what efforts should MetPy be taking to maintain compatibility with CF units/UDUNITS? Right now, it seems to have been just implementing a fix when a particular problem arises (which is a reasonable enough way to continue going forward), but I've also wondered whether it's worth it at some point to parse the UDUNITS xml files and either run tests on MetPy's Pint registry with respect to this database or use the database to build a custom Pint registry configuration file.

dopplershift commented 4 years ago

I think there's some value in it. Really, what we need to support are standards in our field's common data formats. With the exception of random text units in CSV file headers or free text fields in HDF5/netCDF files, the units string attribute in netCDF files conforming to the CF conventions is the only standard unit string I'm aware of us needing to parse. So it's important that we support it and do it right.

Luckily, thanks to @lesserwhirls and netCDF-Java, there's already a place where all the XML unit information from UDUnits2 is aggregated together. I don't think we want to be parsing that at runtime in MetPy, but I think it would be good to do some testing against it and see how much we need to add (and hopefully not modify or remove). One thing I will note is (that I just learned through digging into the "Celsius" issue) is that UDUnits-2 is case insensitive for names only, not symbols. That could present some challenges.

I think we should at least be somewhat pro-active here--at least by following this particular thread (e.g. "Celsius" and look at the XML unit database). I'm not saying we need to exhaustively test 10^5 different unit strings and make sure we get the right results. I bet just solving the two things mentioned here, on top of the mods you've already made, would get us 99% there.

dopplershift commented 4 years ago

Out of curiosity, how much of Pint's parser can we override or outright rip out?

jthielen commented 4 years ago

Out of curiosity, how much of Pint's parser can we override or outright rip out?

I'm not too sure about that! The most extensible parts of Pint right now are unit definitions, constants, and preprocessors functions. Beyond that, I'm pretty sure we'd have to rely on monkey patching or implementing changes upstream to make currently unexposed APIs extensible.

jthielen commented 4 years ago

Also adding to the list here: "Meter" (from #1459)

Should there be a short-term fix for this capitalization issue until more substantial testing can be done? Or is this something we should comment on in the unit support docs?

dopplershift commented 4 years ago

Well, can we make it so we only consider case for symbols? Hell, I don't even know if there are symbols that collide if you make everything lowercase.

jthielen commented 4 years ago

Well, can we make it so we only consider case for symbols? Hell, I don't even know if there are symbols that collide if you make everything lowercase.

Only considering case for symbols would not be easy to do exactly, but there may be ways around that.

Since Pint's parsing internals currently don't seem to distinguish between symbols and any other alias, this would at the very least require making low-level changes in Pint's unit parsing and adding a bunch of extra references back to underlying Definitions to check if it is a symbol, on top of carrying through the appropriate case sensitive options.

Symbols do collide if you make everything lowercase in a preprocessor (e.g., G for gauss vs g for gram and M for molar vs m for meter), as do prefixes (Mega- vs milli-), so a "make everything lowercase" will not work with the current API.

That being said, Pint does have proper case insensitive handling on its UnitRegistry.get_name and UnitRegistry.parse_expression (with case_sensitive=False) that handles these collisions properly, which could get us somewhere. However,

So, if we are okay with a bit of extra leniency on case-insensitive symbols, then this should actually be achievable by adding a case_sensitive_units registry-level setting in Pint, exposing case_sensitive on UnitRegistry.parse_units, and modifying the Unit and Quantity constructors to rely on the case_sensitive_units registry setting when calling the registry's parsing methods. All the hard work would go in Pint, MetPy would just change the case_sensitive_units option from its default to False (monkey-patching these long methods would be rather fragile).

If this all sounds reasonable, should we propose this upstream in Pint?

dopplershift commented 4 years ago

Given that we're not a conformance checker and our users frequently are not in control of their data, we should also lean towards being lenient, provided we're not in an ambiguous situation--which we're not here. I think that's reasonable to propose your suggestion in Pint.

jthielen commented 4 years ago

Once https://github.com/hgrecco/pint/pull/1147 is merged (and 0.15 released?), all we have to do is add the case_sensitive=False option when we create our registry. This will likely need to be done with a try/except block unless we want to bump minimum Pint version to 0.15. If we wanted, the except block could issue a warning about case sensitivity.

Update: https://github.com/hgrecco/pint/pull/1147 has been merged, so this can proceed forward now with upstream dev or wait until 0.15 release.

dopplershift commented 4 years ago

We'll have to ponder whether it's better to bump Pint. I'm doubtful of the existence of a user-base that is capable of installing a new MetPy but is unable to update their version of Pint.

dopplershift commented 4 years ago

So for the record, 0.15 has been released and conda-forge packages are out. I'm going to wait for dependabot to pick it up and that should open us up to testing on it.

jthielen commented 4 years ago

I went ahead and enabled the case_sensitive=False option on our unit registry, and just as quickly discovered problems. Unfortunately, the case sensitive symbol handling is less robust than I was expecting...prefixes with symbols breaks it (so we get things like millibytes instead of millibarnes and kilomolar instead of kilometer). This will take some time to debug, and likely some further upstream changes in Pint, so I'm going to punt on it for now. If anybody's interested, my branch with the case_sensitive=False optional conditionally set is at https://github.com/jthielen/MetPy/tree/pint-case-insensitive.

dopplershift commented 4 years ago

Le sigh.

jthielen commented 3 years ago

For the record, this dataset from #1917 contains an expression that failed our regex ((W m-2 um-1)-1). I hypothesized a fix in https://github.com/Unidata/MetPy/issues/1917#issuecomment-863475400 but didn't test it.

jthielen commented 2 years ago

https://github.com/ncar-xdev/xwrf/pull/50 uncovered another expression that fails our regex: kg(-1), which is interpretted as -1 <Unit('kilogram')> instead of the proper 1.0 <Unit('1 / kilogram')>