Switch to a lex-parse approach

JanSellner commented 4 years ago

This is a major rewrite of the library with a switch to a "lex and parse" approach. The idea is to simplify the implementation by introducing a multi-step process (inspired by how compiler work):

Lexer: the first step is to assign each word in the input sequence a token. A token knows something about the word like its type or its value. The focus is only on an individual word and not on its surrounding or how it should be combined.
Parser: the second step takes care of the combination of tokens to calculate the value of a digit. This is based on some predefined rules. Each rule first checks how many tokens it wants to consume and then applies an action on these tokens to create a new token. Currently, there are only two rules: a combination rule to handle cases like forty-two where we need to calculate something and a second concatenation rule which handles cases like twenty twenty, i.e. years where we only need to concatenate numbers.

I also took the chance and split the implementation into several files.

This is, however, not just a rewrite but also introduces new functionality/fixes some bugs. For a start, the issues #25 and #26 are fixed automatically, i.e. without further adjustments. I also added several new test cases, some of them did not work before (see attached test log). A general new feature is the support of literals in the input string to handle cases like 2.5 thousand -> 2500.

The goal was to ease the implementation process by reducing complexity making it easier to fix bugs or add new features. I hope you like it :-)

Test Results - pytest_in_tests.zip

fersarr commented 4 years ago

Interesting. The approach seems a bit more verbose but perhaps it helps maintainability. I imagine the new approach is also convenient to handle floats ('thirty-nine point seven' or 'thirty-nine and a half') and negatives ('minus nine' or 'negative nine')?

fersarr commented 4 years ago

Could you please also add the new padding tests that I added in my PR #27?

ShailChoksi commented 4 years ago

This is an amazing re-write! Thank you for taking time and doing this. I am still reviewing this but maintainability and extensibility will improve with this approach

JanSellner commented 4 years ago

Could you please also add the new padding tests that I added in my PR #27?

I added the test for the time formats (which also revealed a bug in the code which is now also fixed). I did not add the keep_zero_padding option, though, because I think that this library should not alter existing numbers in the text if they are not combined with number words. If someone wants to remove zero paddings, this is probably an additional processing step.

JanSellner commented 4 years ago

Interesting. The approach seems a bit more verbose but perhaps it helps maintainability. I imagine the new approach is also convenient to handle floats ('thirty-nine point seven' or 'thirty-nine and a half') and negatives ('minus nine' or 'negative nine')?

Yes, this is exactly the kind of new feature which should be easy to implement: we only have to think about new token types (e.g. sign) and then either adjust the existing rules or add a new one to handle these cases. I can take a look into it :-)

ShailChoksi commented 4 years ago

@JanSellner Is there any work being done on this or can I merge this in?

JanSellner commented 4 years ago

From my side it is done regarding this PR. For future extensions (e.g. decimal numbers), I would make a new PR.

fersarr commented 4 years ago

@ShailChoksi no rush, but when do you expect the new release with these fixes to come out?

ShailChoksi commented 4 years ago

@fersarr just now! done :)

fersarr commented 4 years ago

Wow great! thank you @ShailChoksi !

ShailChoksi / text2digits

Switch to a lex-parse approach #28