Alternative parser using a standard tokenizer with multiple customization possible

pombredanne commented 8 years ago

This is rough proof of concept that supports more or less:

11 and #18 (alternative operators)
9 (feeding sequences to the parser with via custom tokenizers)
19 (custom symbols)

It has these benefits IMHO:

Use the standard lib "tokenize" module
Accept arbitrary Python names as symbols
Support and, not, or, !, ~, +, *, &, | as operators in any combination
Parse comments and multi-line expressions
Accept any white spaces
Errors are reported with row and columns
Accept alternative tokenizer, allowing to have custom symbols and operators
Accept custom symbols class

pombredanne commented 8 years ago

@Kronuz @bastikr feedback wanted. This is just a first quick shot

pombredanne commented 8 years ago

in particular check the rough test to get a feel of the results: https://github.com/bastikr/boolean.py/pull/23/files#diff-63c8b2556011726f89a5fe31e9c596f0R29 FWIW all the tests pass.

pombredanne commented 8 years ago

Note that in contrast with #18 it accepts any combos of operators. The part that is not there and should be borrowed from #18 is the support for and and or magics.

One addition that would be needed is to customize the standard representation of a given parsed expression to have some alternative to the default +*~ operators such as and or not or `!|&~' or anything else when serializing back to an expression string, and include some pretty printing there too./

bastikr commented 8 years ago

This looks nice at a first glance, but I don't have the time to look at it in detail. But if it passes the tests and @Kronuz agrees please go ahead and merge it.

pombredanne commented 8 years ago

@Kronuz This is rebased on the latest master and includes improvement based on your feedback Note that it should everything and more that you have in #18

pombredanne commented 8 years ago

@Kronuz btw adding support for dotted name would be an easy one. Either in the base tokenizer or in a custom one. Using a stack and testing the token type to token.DOT https://docs.python.org/2/library/token.html#token.DOT

Kronuz commented 8 years ago

@pombredanne, what do you think about the class based tokenizer? I can work on it after you merge this one if you want.

bastikr commented 8 years ago

Is a class based tokenizer really necessary? I feel like its use cases are limited and it would just make the code more bloated. If someone really has to support a different style it's probably easier and straight forward for him to do some string replacements instead. What do you think @pombredanne ?

Kronuz commented 8 years ago

Maybe, although I don't believe a class would make it more bloated, it'd basically be one more line and encapsulate the globals.

Nonetheless, I would still argue the function should return actual tokens instead of stings. It could return boolean.TRUE, Symbol when it's the case and intermediate token objects LPAR, RPAR, AND, OR, etc. in other cases... then the parser would be more straightforward and probably more efficient as well (and no need for is_symbol).

pombredanne commented 8 years ago

@Kronuz you have a good point about simplifying the parsing and globals, but a class may not be always needed as @bastikr pointedd . A callable could be generic enough and could be implemented as a class too. Let me come with a refined implementation

pombredanne commented 8 years ago

@bastikr @Kronuz See the latest refinements. The parser accepts either a string and uses the standard tokenizer or an iterable of 4-tuple of objects. The standard tokenizer return an iterable of 4-tuple of objects too. Custom tokenizers just need to feed such an iterable to the parse function using whatever you like (classes, functions, etc.)

Kronuz commented 8 years ago

@pombredanne I like it, it's cleaner this way :+1:. Could it be possible for you to add to the custom tokenizer in the tests (as example), something that would make it accept symbols with dot and colons as part of the symbol? (i.e. symbols of the form namespace:class.name).

pombredanne commented 8 years ago

@Kronuz definitely. I will include a clean example tokenizer that accepts any combination of colon and . separated identifiers as a single symbol. @bastikr If you are ok with this I will rebase then merge in master sometimes tomorrow

bastikr commented 8 years ago

If both of you are satisfied with it please go ahead and merge it.

pombredanne commented 8 years ago

@Kronuz The latest commit includes an extensive test for a tokenizer recognizing symbols with dot and colons and symbols of the form namespace:class.name

Kronuz commented 8 years ago

Great! I just Added a raw commit fixing a few small issues in this merge (Python 2 and 3 compatibility, some PEP 8 and importing cleanups)

bastikr / boolean.py

Alternative parser using a standard tokenizer with multiple customization possible #23

11 and #18 (alternative operators)

9 (feeding sequences to the parser with via custom tokenizers)

19 (custom symbols)