MycroftAI / lingua-franca

Mycroft's multilingual text parsing and formatting library
Apache License 2.0
75 stars 79 forks source link

Hyphens are not handled #83

Open JuneStepp opened 4 years ago

JuneStepp commented 4 years ago

By many it is considered proper to write numbers like "sixty seven" as "sixty-seven". Currently, hyphens make the numbers invalid. It seems like simply replacing hyphens with spaces before tokenization would solve the issue. I know that hyphens are also used in French.

ChanceNCounter commented 4 years ago

It would be important to avoid removing leading hyphens, as this would mess with negative numerals (-2.5)

JarbasAl commented 4 years ago

would this make sense to be handled in the normalizer?

ChanceNCounter commented 4 years ago

I think we should assume that hyphens will become syntactically important to other parsers. I don't think they'll ever be meaningful for the number extractors, but I suspect their grammatical significance will come up elsewhere.

extract_number()-specific normalization does seem to be a growing concern, though. Should we write a "sub-normalizer" for the number extractors?

JarbasAl commented 4 years ago

i don't think a subnormalizer makes much sense, just adds extra places to keep track off.

We already do words to digits in the Normalizer, so it makes sense to handle this there, and this could even be considered a bug in words to digits

JarbasAl commented 4 years ago

actually, i think normalize should be called before extract_numbers which is not done at all, we need to test if something breaks when doing this.

It would avoid several rounds of normalization and keep everything in one place

ChanceNCounter commented 4 years ago

Yes, but what if some other parser cares about the hyphen? For instance, many French place names have a hyphen in them, which we wouldn't want the normalizer to lose.

On the other hand, this will be the second thing in as many days that the number parsers have to normalize themselves: numbers (and only numbers) need the input converted to lowercase before parsing, and numbers (and only numbers) need hyphens normalized away before parsing.

ChanceNCounter commented 4 years ago

Indeed, words to digits will almost certainly break extract_numbers, as it would turn "twenty two" into [20, 2]

ChanceNCounter commented 4 years ago

I've been thinking about this, and I'm now thinking subnormalizers might reduce each parser's individual complexity, and actually make it easier to track things.

For instance, normalizing a string for extract_datetime() could replace "September" with something like "{month:9}" to simplify the crawling portion of the extractor, but other parsers wouldn't want that, because there are other contexts in which the word September isn't part of a datetime.

For extract_number(), we'd want to remove hyphens unless they are found in the form:

(edit: fixed this regex) (^|\B)\-\d*((\.|,){1}\d*)?

Other such cases surely apply. If we piled all of these into relevant helper functions on a per-parser basis, we'd know exactly where to find them, and exactly where to put new requirements for existing parsers.

Furthermore, if these weren't classified as helpers, people using the library could employ them for intermediate purposes: maybe a Mycroft skill author would like to quickly and easily scan an utterance for that {month: 9} information.

krisgesling commented 4 years ago

Just on the month side of things we need to pay special attention to "may" and "march" in English.

ChanceNCounter commented 4 years ago

Re: hyphens, here's another edge case. When parsing text,

"47-48"

should not be normalized to "47 48". It should be normalized to "47 to 48".

krisgesling commented 4 years ago

That's a tricky one, would it depend on the context? I'm thinking about a phone number "555-1234-1234" as an alternate example

ChanceNCounter commented 4 years ago

Excellent point. In fact, phone numbers probably represent an unhandled edge case for number parsing in general, not just for normalizing.

Is a hyphen-as-range context dependent? I could be missing something, but I think, barring typos, other contexts would have a space in between the numbers and the dash.

JuneStepp commented 4 years ago

I should also mention that the problem is opposite in French. Numbers are only properly parsed if they have the hyphen, but similar to in the English people often don't use it.