ShailChoksi / text2digits

Converts text such as "twenty three" to number/digit "23" in any sentence
MIT License
66 stars 22 forks source link

problems at start or end of a sentence where there isn't a space #37

Open mroughan opened 3 years ago

mroughan commented 3 years ago

There are problems when a number word appears at the start or end of a sentence:

t2d.convert("eight.") 'eight.'

or t2d.convert("“One hundred ") '“One 100 '

nmstoker commented 3 years ago

Although with your example t2d.convert("“One hundred "), it only seems to be down to the starting quote, since this works as expected:

t2d.convert("One hundred") gives '100'

Guessing a solution would need to split tokens slightly differently so the number-words are not considered part of the connected token (eg a ' " ' or '.' or '-')

JulesGM commented 1 year ago

I also have this. If a sentence finishes with a word number and a dot, the number is not converted. eg "What is two plus seven ? The answer is nine." -> "What is 2 plus 7? the answer is nine."

JulesGM commented 1 year ago

@ShailChoksi

JulesGM commented 1 year ago

The bug is in the lexer

Screen Shot 2022-11-18 at 1 23 55 PM
JulesGM commented 1 year ago

The error is likely in split_glues, as the dot in "nine." is identified as part of the word.

Screen Shot 2022-11-18 at 1 25 54 PM
JulesGM commented 1 year ago

So, we find separators. Separators are either whitespace or punctuation with things that are not numbers on both sides.

Screen Shot 2022-11-18 at 1 37 38 PM

The problem is that after the last separator has been found, the function returns the last part of the sequence, even if it has punctuation inside of it. The way this happens is if there is punctuation at the end of the final word, as the pattern only detects punctuation if there's non-word characters on both sides.

I added code to extract the non-punctuation characters in the final word. This seems to correct the problem.

Screen Shot 2022-11-18 at 1 47 58 PM
JulesGM commented 1 year ago

Added a test to make sure that there is still text to work on, otherwise it would crash.

Screen Shot 2022-11-18 at 1 58 20 PM
JulesGM commented 1 year ago

Passes all tests. Pull request at https://github.com/ShailChoksi/text2digits/pull/48