MycroftAI / lingua-franca

Mycroft's multilingual text parsing and formatting library
Apache License 2.0
75 stars 79 forks source link

Handling Preceding Zeroes #204

Open firebladed opened 3 years ago

firebladed commented 3 years ago

Describe the bug Zeroes preceding a non zero digit are ignored, either initially or following a pause

the problem is partly related to the in-predictability of pauses in readings of number sequences as "0 1 4 6 0 6" is correct interpreted to [0.0, 1.0, 4.0, 6.0, 0.0, 6.0] but "01 46 06") incorrectly goes to [1.0, 46.0, 6.0]

To Reproduce Steps to reproduce the behavior:

extract_numbers("010 101") [10.0, 101.0] extract_numbers("01 010 101") [1.0, 10.0, 101.0] extract_numbers("51 21 05") [51.0, 21.0, 5.0] extract_numbers("01 46 06") [1.0, 46.0, 6.0]

Expected behavior

zeros should be added to output as separate numbers,

I think zeros preceding a single non zero digit should be treated as a separate number, either by default or as an option

e.g "0 1" (zero one) -> [0, 1] "01 46 06" (zero one four six zero six) -> [0, 1, 46, 0, 6]

Additional context this is problematic used for reading code numbers e.g totp codes which could be zero in any digit and can be read in multiple ways

e.g 0 1 4 6 0 6 (zero one four six zero six) 34 45 65 (three four four five six seven ,thirty four forty five sixty five) 234 567 (two hundred and thirty four five hundred and sixty seven

one aspect i'm not sure of is should 46 read as "four six" be interpreted as [46] or [4, 6] when preceding a decimal (or there is no decimal) after a decimal point is different as "normal" reading is e.g 0.01475 (zero point zero one four seven five)

however "46" (fourty six) can always be converted to "4 6" however missing zeroes cannot be recovered

JarbasAl commented 3 years ago

you want to keep an eye on https://github.com/MycroftAI/lingua-franca/pull/150

EDIT: nvm, its the reverse problem....

ChanceNCounter commented 3 years ago

Partially misplaced, I think. Apparently planned #150 format.pronounce_digits() would be a more appropriate function call for the suggested behavior.

However, I'm not sure if it retains leading zeroes at the moment, either, because it uses extract_number() along the way.

The fundamental challenge here is continuing to treating the input as a string while parsing.


Relating this back to the code side, the English number extractors "chunk" numbers as they go based on powers of 10. While parsing a base-10 number left-to-right, whenever you encounter a power of 10, you scan the remainder of the number for larger powers of 10. If you do not find any, you have identified the end of a "place."

"1,075,018" -> 1000000 | 75,000 | 18 -> sum() -> 1075018

ChanceNCounter commented 3 years ago

I stand corrected. In the current version of the PR, format.pronounce_digits() does indeed preserve leading zeroes:


>>> format.pronounce_digits("014606")
'zero one four six zero six'
ChanceNCounter commented 3 years ago

On reflection, the "fail" case above is OOS. If the input appears to mean something specific - "46" == 46.0 - LF can't account for whether the program calling its parsers meant to feed it "46".

I vote one of two things:

  1. Add a sugar parameter extract_numbers(..., max_digits=0) where False things retain the current behavior
    • Pros: sugar, function signature isn't very long
    • Cons: edge case, needs localization and some non-English extractors already need work
  2. wontfix
krisgesling commented 3 years ago

Hey @Firebladed,

If we're looking at STT output, another option might be something like an extract_digits() method that intentionally pulls out all the digits in a string as individual numbers. I think this will be more straightforward than trying to determine when people meant to have digits expressed together or not.

Can anyone think of cases other than codes or phone numbers, where this would come up?

If it won't be supported in the extract_number(s) methods we probably need to add a note to the docstring that leading zero's will be ignored.

Probably not what you're referring to, but just in case... If it's something that you know is a number like a TOTP or PIN returned from another system, then I'd suggest that extract_numbers() is probably overkill. For example, if you typecast the string to a list you get your list of digits:

>>> totp = "012345"
>>> list(totp)
['0', '1', '2', '3', '4', '5']

If there might be spaces in the source:

>>> totp = "01 2 3 45"
>>> list(totp.replace(" ",""))
['0', '1', '2', '3', '4', '5']

or if the source may be an int you would need to do something slightly more verbose:

>>> totp = 123456   # note an int cannot have a leading zero
>>> [digit for digit in str(totp)]
['1', '2', '3', '4', '5'. '6']

This could possibly act as a workaround for the STT case:

extracted_codes = [
    list(utterance.replace(" ","")),
    extract_numbers(utterance)[0]
]
if totp in extracted_codes:
    authenticated = True