SciTools / cf-units

Units of measure as required by the Climate and Forecast (CF) Metadata Conventions
https://cf-units.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
66 stars 49 forks source link

Plan toward own pure-python udunits [DRAGON!] #446

Open pp-mo opened 2 months ago

pp-mo commented 2 months ago

Since so many outstanding problems + request relate to builds for alternative OS or architectures

Probably main issues :

ocefpaf commented 2 months ago

@pp-mo I'd love to meet with you'all and talk about this topic. I already opened a few similar threads in other places, there are tons of work that could be use to achieve this quicker than starting from 0. My latest attempt, which is to leverage Pint by adding udunits units to it, is in https://github.com/ioos/compliance-checker/pull/1094

Most of the discussion is happening in https://github.com/Ouranosinc/xclim/issues/1788 Both cf_xarray and metpy already have some version of it. Mine us missing only the COARDS time standard, which I plan to implement soon. With that said, I'd love to not have my own library but to contribute to a 3rd party instead.

pelson commented 2 months ago

Found this through #454...

I see the :dragon_face: and I salute you :saluting_face:!

The first thing to state: the CF-conventions are currently (last time I checked) bound to the udunits2 implementation. This is a highly questionable situation (without casting any judgement on udunit2), as to me, a standard isn't a standard if you can't define it without an implementation... It also means that it is hard to call any other implementation CF compliant, because there is no spec to comply to. The best you can do is to validate the implementation against udunits2.

All of this said, I did a lot of work to make a udunits2 compatible parser in this project. I did this with the express goal of one-day opening the conversation with the CF-conventions body to propose adoption of a BNF grammar for units (note that udunits2 does have such a grammar... but it isn't followed in the udunits2 implementation :facepalm: https://pelson.github.io/2019/cf_units_tex/, which makes the udnuits2 BNF definition not a CF compliant specification...).

I didn't asses the amount of work that follows to implement a udunit2 compatible library, once you have parsing in-place. I don't really know how udunits2 defines the conversions, for example. It could be that you read in the XML and you're nearly done... or it could be that dragons :dragon_face: :dragon_face: :dragon_face: start flying from the nest :joy:.

For old-times sake, consider me interested in the progress on this issue!

trexfeathers commented 2 months ago

@pelson we're often battling with a tricky compromise between:

Our current solution - depending on udunits2, the same dependency that CF has - is already a good compromise in terms of a joined up ecosystem and avoiding divergence (which would be a risk if maintaining a 'parallel' implementation). But as @pp-mo has said elsewhere it is proving unrealistic while our Cython expertise remains low, making a 'parallel' Python implementation look like the best compromise, despite the known costs.

ocefpaf commented 2 months ago

It could be that you read in the XML and you're nearly done... or it could be that dragons 🐲 🐲 🐲 start flying from the nest 😂.

There is a large amount of libraries that requires CF-like units out there that already dropped udunits2 in lieu for some pint hacks (sometimes not hacks but actual support). Some examples are cf_xarray, metpy, xclim, etc. In a way, dragons are already flying.

Udunits2 is super stable, so take this with a grain of salt, but it is not well maintained. PRs languish, releases are few and far between, people hack the XML to add new units (CF ones nonetheless!) b/c it takes way too long to add them into udunits2, etc.

TL;DR I don't think a pure Python implementation is a bad idea. Maybe it is what we need to unify the ecosystem. I know some folks, that are connected to the CF leadership, are working on a Rust version of udunits. IMO, that is not going to solve out problem as we are only substituting one compiled dependency for another.

pelson commented 2 months ago

Thanks for the inputs :+1:

A pragmatic solution - finding a path we can manage with the resources and expertise at hand

On this particular topic, it is clear that it needs to be bigger than SciTools. You are absolutely correct that you probably can't (and shouldn't) do this with just the resources you have to hand. But that doesn't mean it can't be done... it is just that it would need to be done in a broader community context (Filipe is very well connected :wink:).

working on a Rust version of udunits

Well, this would be better than the current implementation IMO. You know better than me the compatibility issues that raises, but I would guess they are less concerning than C+Cython (existing knowledge aside).

But this misses the point (please feel free to pass on my regards to :tumbler_glass: :wink:) - my POV is that there should be a standard, not an implementation (even if it is in pure Python). If that means that the CF conventions need to define the BNF and the XML tables in the appendix, so be it.

For the record, I wanted to point to the tests that exist in this project at https://github.com/SciTools/cf-units/blob/main/cf_units/tests/integration/parse/test_parse.py. They are fairly short (<400 LoC) but encode a lot of ad-hoc learning about what udunits2 does under-the-hood (which you have to figure out, because there is no spec...).

pint hacks

I don't know how well this is working, but a quick search suggests a few years ago there was a bit of pain with the fact that the udunits2 implementation is its own standard, and there is a dissonance between that and other standards (e.g. https://github.com/hgrecco/pint/issues/1081). My guess is that the ship has sailed on this... you can't go back and change the CF spec for the tonnes of past data out there, so you are forced to handle the existing capabilities of udunits2 in whatever tool could replace it. It isn't obvious that this won't need to be something very specialised. Maybe though my doom and gloom is unnecessary, if in practice it is working out well with each of those libraries! :sun_with_face:

ocefpaf commented 2 months ago

Maybe though my doom and gloom is unnecessary, if in practice it is working out well with each of those libraries! 🌞

Working is a strong word 😬 . It is what we have and, when some units fails to parse, we add another workaround. The more we do things like this, the further away we are from solving this problem IMO. I agree that a Rust implementation is not all bad, but it is still udunits2, just with a fresh coat of paint and new maintainer. who may or may not stay for the long run.

I don't want to pollute this thread with my old man bitterness, and Phil nailed all the gotchas that may happen in the comment above. With that said, I don't think the CF committee will do what needs to be done to sort this out. Also, if you all want to push for a pure Python version, count me in! And I'll try to bring others along.

pelson commented 1 month ago

@ocefpaf - what would the first functionality that you would need from a pure Python implementation be?

Unit validation, conversion, simplification, XML extensions beyond the UDUNITS2 standard XML, or something else?

I have a fairly advanced proof-of-concept in https://github.com/pelson/pyudunits2/, and think that it could be interesting to refine it towards your ideas at this point.

ocefpaf commented 1 month ago

Unit validation and conversion first (actually check if is_convertible), and is_dimensionless.

Our main use is in the compliance-checker library. For example, this is how we validate that the user passed a valid unit for vertical coordinate.

pelson commented 3 weeks ago

Unit validation and conversion first (actually check if is_convertible), and is_dimensionless.

It has taken me a while to finish up (there were a few dragons :dragon:), but I have something that seems to hold up nicely against the comparisons I have tested with udunits2.

Note that your linked example can be simplified now with the new dimensionality functionality...

In [2]: for unit_str in ['m', 'km', 'yards', 'inches', 'fathoms',]:
   ...:     print(f'Dimensionality of {unit_str}: {ut_system.unit(unit_str).dimensionality()}')
   ...: 
Dimensionality of m: {'meter': 1}
Dimensionality of km: {'meter': 1}
Dimensionality of yards: {'meter': 1}
Dimensionality of inches: {'meter': 1}
Dimensionality of fathoms: {'meter': 1}

(note that the number is the order of that basis unit, so equivalent to m^1).

In [3]: length_dimensionality = ut_system.unit('fathoms').dimensionality()

In [4]: for unit_str in ['m', 'km', 'yards', 'inches', 'fathoms', 'm/s', 'kg']:
   ...:     print(f'Is {unit_str} a length unit?: {ut_system.unit(unit_str).dimensionality() == length_dimensionality}')
   ...: 
Is m a length unit?: True
Is km a length unit?: True
Is yards a length unit?: True
Is inches a length unit?: True
Is fathoms a length unit?: True
Is m/s a length unit?: False
Is kg a length unit?: False

There is definitely still things coming out of the woodwork... for example, in the udunits2 XML file there is no mention of the plural of inch, and before I just fixed it, you couldn't get a unit for inches (I was naively exposing a inchs unit). It is for this reason that I think some real-world testing would be super insightful.

I suggest we can discuss this in a dedicated issue at https://github.com/pelson/pyudunits2/issues/3.