Column index property on Token class.

jpsnyder commented 2 years ago

I'm trying to figure out an easy way to add a column_index property attribute to the Token object. But trying to patch this in is proving to be very complex.

Since the Lexer class keeps an attribute of self.text in its instance. I thought it might be easy to attach the text to any generated Token object and then add a property column_index to be able to access the column index on-demand. By using a property, we can avoid pre-calculating this on all tokens, but it will be available when we need it (such as error reporting)

class Token(object):
    '''
    Representation of a single token.
    '''
    __slots__ = ('text', 'type', 'value', 'lineno', 'index')

    def __init__(text: str):
        self.text = text

    def __repr__(self):
        return f'Token(type={self.type!r}, value={self.value!r}, lineno={self.lineno}, column_index={self.column_index}, index={self.index})'

    @property
    def column_index(self) -> int:
        """Determines column index of given token"""
        last_cr = self.text.rfind('\n', 0, self.index)
        if last_cr < 0:
            last_cr = 0
        column = (self.index - last_cr) + 1
        return column

Please let me know your thoughts on this.

If you don't want to include such a feature, perhaps we can add a way to provide our own alternative implementation of Token and a way to inject the token build to allow us to do whatever we want?

class MyLexer(sly.Lexer):
    Token = MyCustomToken

    def build_token(self, token):
        token.text = self.text
        return token

dabeaz commented 2 years ago

The lexer in SLY is structured as a generator. One way to implement this is to write another generator function that filters/rewrites the token stream.

class MyCustomToken:
     ...

def as_custom_tokens(tokens):
    for tok in tokens:
        yield MyCustomToken(...)

You'd wrap the original token stream with as_custom_tokens(). For example:

parser.parse(as_custom_tokens(lexer.tokenize(text)))

It's probably not the only way to do it, but this general approach can be useful for modifying any aspect of the input token stream.

jpsnyder commented 2 years ago

I'm aware of this technique. (My lexers already heavily include customized tokenize() routines) But I didn't think of wrapping the tokenizer outside of the call as well. brainfart

Was trying to avoid lots of duplicate code that would need to be injected into all of the lexers in my project. So this seems reasonable to do since there is only a single call to .parse()

For anyone interested, I ended up writing something like this which will give the column instead of the index when you print it out in error messages, etc.

def _as_custom_tokens(text, tokenizer):
    """
    Customizes tokens coming from tokenizer to include column indexing.
    """
    @dataclass
    class _Token:
        type: str
        value: Any
        lineno: int
        index: int

        def __repr__(self):
            return f'Token(type={self.type!r}, value={self.value!r}, lineno={self.lineno}, column={self.column})'

        @property
        def column(self) -> int:
            """Determines column index of given token"""
            last_cr = text.rfind('\n', 0, self.index)
            if last_cr < 0:
                last_cr = 0
            return (self.index - last_cr) + 1

    for token in tokenizer:
        yield _Token(token.type, token.value, token.lineno, token.index)

...

lexer = get_lexer(language)
parser = get_parser(language)

tokens = _as_custom_tokens(code, lexer.tokenize(code))
root_node = parser.parse(tokens)

Thanks!

dabeaz / sly

Column index property on Token class. #98