Closed fquirin closed 3 years ago
@rtxm do you think its likely that this PR might end up in the master? Because if not I probably have to continue working with the fork and the version might diverge too far for another PR at some point :-/
Thank you for this valuable contribution! Merging now!
I've taken the existing 'german-support-proposal' branch, updated it to include the latest master changes and tried to fix the most urgent open tasks :sweat_smile:
This work builds on: https://github.com/allo-media/text2num/pull/46
Since German language support is taking the current library architecture to it's limits I've tried to introduce a few more interfaces and methods:
WordStreamValueParserInterface
as base class for allWordStreamValueParser
s (Common and German for now)push
,parse
andvalue
as methods for the interface.parse
is ment to be used if tokenize-push is "complicated" or if the whole text should be parsed at once anywaysplit_number_word
method to 'Language' base class to handle cases where a number-word is composed of multiple parts that need to be processed individually, e.g. the German "zweihunderteinundfünfzig" -> "zwei hundert ein und fünfzig" (251)Overall I think German support is not very well optimized yet but I managed to survive all the test cases ^^.
I strongly recommend to label this 'German BETA' but include it in the next version so we can start to improve general architecture (German will not be the only language with these issues, I'm thinking of Turkish for example as well) and to iterate more easily because if the master keeps evolving without these changes the next try to add support for German will start from zero again :-/
[EDIT] Btw I've fixed the errors shown by
mypy
but there is one intransforms.py
that makes no sense ^^