customize TOKENS - Githubissues

gemerden commented 7 years ago

In the tokenize() method of BooleanAlgebra, would it be possible to change the tokens without inheriting the whole method and changing just the tokens, e.g.:

def tokenize(self, expr, TOKENS=None):
    ...
    TOKENS = TOKENS or {
         # current TOKENS
    }
    ...

Or perhaps define the current tokens outside the method and make them the default TOKENS instead of None above.

This makes it less likely that in future versions the inheriting class becomes outdated.

Cheers, Lars

pombredanne commented 7 years ago

@gemerden Thanks. This is an easy change that makes a lot of sense Just out of curiosity, what would be custom TOKENS you would need? BTW, slightly related, here is an example of a tokenizer that uses customs tokens (and uses a trie/aho-corasick automaton for tokens recognition) https://github.com/nexB/license-expression/blob/f3421c1a1f409249ba86a16b7b46c2e987f6ab35/src/license_expression/__init__.py#L409

gemerden commented 7 years ago

@pombredanne: i only use '|', '&' and '!', '(' and ')' and i use e.g. '*' for something else (as a wildcard). I needed to change more in tokenize(); roughly: everything that is not a token i accept as a symbol, but i need to do some more testing. Currently it looks like this:

class KeyParser(BooleanAlgebra):

    DEFAULT_TOKENS = {
        '&': TOKEN_AND,
        '|': TOKEN_OR,
        '!': TOKEN_NOT,
        '(': TOKEN_LPAR,
        ')': TOKEN_RPAR,
    }

    def __init__(self, TOKENS=None, *args, **kwargs):
        super(KeyParser, self).__init__(Symbol_class=WildSymbol,
                                        OR_class=SET_OR,
                                        AND_class=SET_AND,
                                        NOT_class=SET_NOT,
                                        *args, **kwargs)
        self.TOKENS = TOKENS or self.DEFAULT_TOKENS

    def tokenize(self, expr):

        if not isinstance(expr, basestring):
            raise TypeError('expr must be string but it is %s.' % type(expr))
        TOKENS = self.TOKENS
        length = len(expr)
        position = 0
        while position < length:
            tok = expr[position]

            sym = tok not in TOKENS
            if sym:
                position += 1
                while position < length:
                    char = expr[position]
                    if char not in TOKENS:
                        position += 1
                        tok += char
                    else:
                        break
                position -= 1

            try:
                yield TOKENS[tok], tok, position
            except KeyError:
                if sym:
                    yield TOKEN_SYMBOL, tok, position
                else:
                    raise ParseError(token_string=tok, position=position, error_code=PARSE_UNKNOWN_TOKEN)
            position += 1

by sym = tok not in TOKENS i leave the possibility to put more (a different) syntax in the symbols. When I am happy with my project I'll make the repo public and share the link here.

pombredanne commented 7 years ago

@gemerden OK, check also this other simpler tokenizer: https://github.com/nexB/license-expression/blob/master/src/license_expression/__init__.py#L1127

gemerden commented 7 years ago

Thanks, the code above is passing all my tests, so for now i am ok.

pombredanne commented 7 years ago

ok, your call. You can send a PR or close this as you like.

bastikr / boolean.py

customize TOKENS #74