bjornbm / dimensional

Dimensional library variant built on Data Kinds, Closed Type Families, TypeNats (GHC 7.8+).
BSD 3-Clause "New" or "Revised" License
102 stars 16 forks source link

Unit parsing #8

Open dmcclean opened 10 years ago

dmcclean commented 10 years ago

Following on to #7, the flip side to pretty-printing is parsing.

Unfortunately the SI unit notation and names have several ambiguities, documented (perhaps not extensively) by this article.

Nevertheless, we need not let that discourage us. We can either pick conventions, or report the ambiguities if the arise, or disambiguate them because we know what dimension we were expecting the user to enter in that field, or some combination of those strategies.

bjornbm commented 10 years ago

If the expected physical dimensions are known the unambiguities in the article do not arise (I believe)?

If necessary we could return a list of possible parses, so that the program which decide which to use or ask the user to disambiguate.

dmcclean commented 10 years ago

Those are both good thoughts.

It may be that the ambiguities disappear entirely if you know the expected dimension. Although it might be tough to prove that exhaustively, especially because multiple ambiguities in the same expression might potentially interact.

Returning a list of parses is great for my purposes. Also possibly good for a quasi-quote, since it could generate an error if the unit was ambiguous.

I've never worked on a parser for a language with this kind of ambiguity, I will need to google it a bit. I think my strategy might be to bake the list of prefixes into the parser, but to use a map that has the definitions of the actual units in play.

I'm not sure whether to try supporting concatenation-as-multiplication or to require spaces. Got any thoughts on that one?

bjornbm commented 10 years ago

In your situation I would start by requiring spaces for the same of simplicity. Concatenation could be a nice to have but can come later, or perhaps not at all if at the cost of ambiguity. I am assuming no one else is levying requirements on you.

dmcclean commented 10 years ago

I'm going to let this one sit. It turns out I can get everything I really need from a drop down list.

It potentially might be nice for the dimensional matrix quasiquote, but we'll see.

bjornbm commented 10 years ago

Did you see this other guy's take on a unit parsing, announced on the Haskell mailing list recently?

I haven't looked into it much, but if it is good an easy solution could be along the lines of converting from his data type to your AnyQuantity and then promote to a full Dimensional.

bjornbm commented 10 years ago

Another unambiguous representation is The Unified Code for Units of Measure:

The Unified Code for Units of Measure is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans.

There is also the Metric Interchange Format.

dmcclean commented 10 years ago

I like those. I think the Unified Code for Units of Measure might be the best fit.

I did happen to see the Haskell mailing list announcement you mention, but I didn't follow the link. That is interesting as it pertains to parsing, though I think that it will be better to follow a published standard.

My only problem with the Metric Interchange Format is that my problem domain of aviation is full of various legacy units. Certainly it might be nice to have parsers for both, and output to the Metric Interchange Format.

The Metric Interchange Format's decision to allow fractional exponents so that they could express noise densities in the usual way strikes me as a bit suspect and couldn't really be supported, but a note to that effect in the documentation should be enough for most people I would think?

Similarly the UCUM's decision to allow arbitrary units for counting things doesn't square well with dimensional...

dmcclean commented 9 years ago

I think everything is pretty well in place for this in my branch, assuming the user can supply a dictionary from names to units.

For making that dictionary it will be very nice to eventually have https://ghc.haskell.org/trac/ghc/ticket/10391. I'm going to try to contribute a fix, once I can figure out how on Earth to do all the ancillary stuff involved in setting up an environment for hacking on GHC.

jdreaver commented 8 years ago

In my package, quantities, I require that any prefix be followed by a valid unit. For example, if I see "millimeter", I will identify "milli" as a valid prefix, but I will also make sure that "meter" is a valid unit too.

Secondly, if a string is already a valid unit (it matches a unit in the units list exactly), then I do no prefix matching. Take "min" for example. It can either be "minute", or "milli-inch". Since I have a definition for "min" as minute, then I don't try to extract a prefix.

Another example in my test suite is the string "hr". It is obviously hour, but a naive parser would extract "h" for "hecto-", and then fail at identifying the unit "r" (assuming no definition for a unit with string "r"). However, since "hr" is already valid, I just say it's an hour.

dmcclean commented 8 years ago

Very interesting, thanks @jdreaver. We should look at adapting your parser to return the type here: https://github.com/bjornbm/dimensional-dk/blob/master/src/Numeric/Units/Dimensional/UnitNames/Internal.hs#L32

@bjornbm's comment above about returning multiple parses in ambiguous situations like that is a good idea, because it does allow us you use information about the desired dimension of the result to constrain parses.

The situations you are describing in the last two paragraphs I have seen called the maximal munch rule.

Is it true that your parser aims to parse natural-language English names for units for use in calculators and so forth? If so that's very cool. Google's calculator for those kinds of things is nice but does have a few deficiencies.

jdreaver commented 8 years ago

Thanks!

Is it true that your parser aims to parse natural-language English names for units for use in calculators and so forth?

When I define each unit, I have a list of synonyms after the original definition. I just keep an internal map from synonym to unit, so the synonyms are replaced at parse time. Take a look at my definition file. Here are a couple of examples:

"hour = 60 * minute = h = hr"
"turn = 2 * pi * radian = revolution = cycle = circle"

There is definitely no natural language processing going on, just a list of synonyms that need to be exact matches (barring prefixes, of course).

dmcclean commented 8 years ago

This is coming along really nicely in dimensional-attoparsec (only builds against the prefixes branch of dimensional. It just needs a few minor tweaks to work around some oddities of the UCUM grammar.

One real mismatch between us and them is that they don't treat amount of substance as a dimension. They define the mole to be a dimensionless unit equal to 6.0221367 * 10^23 (the 1986 CODATA value of Avogadro's number). On the other hand they do treat angle as a dimension. This only matters if we wanted to test against the "canonical form" column of their example table, but since the section number they refer to as defining the canonical form doesn't appear to actually exist, I'm inclined to document where we depart from them and ignore it.