dabeaz / sly

Sly Lex Yacc
Other
816 stars 107 forks source link

Add IGNORECASE ability to Token Remapping. #67

Open manux81 opened 3 years ago

manux81 commented 3 years ago

With reference to: https://github.com/dabeaz/sly/issues/62 I've found a problem with tokens remapping in the eventuality of: reflags = re.IGNORECASE This change allows creating special cases ignoring capitalization in the eventuality of reflags = re.IGNORECASE is set in Lexer class.

Example: class MyLexer(sly.Lexer): reflags = re.IGNORECASE

Base ID rule

ID = r'[a-zA-Z_][a-zA-Z0-9_]*'

# Special cases
ID['if'] = IF
ID['else'] = ELSE
ID['while'] = WHILE
# Now if,If,IF,iF are handle as the same special case
jpsnyder commented 2 years ago

I'm running into this issue too. Any update on getting this merged in?

manux81 commented 2 years ago

Hi @jpsnyder, no @dabeaz still has not been reply to me.

dabeaz commented 2 years ago

Apologies on the delay. I do NOT see this change making it into SLY because of its special purpose nature and potential impact on performance. However, it is possible to achieve the desired effect using a function:

keywords = { 'if', 'else', 'while', }

@(_r'[a-zA-Z_][a-zA-Z0-9_]*')
def ID(self, t):
    if t.value.lower() in keywords:
        t.type = t.value.upper()
    return t
dabeaz commented 2 years ago

I would just note, that I'm going to keep this open for now. Even though I'm leaning towards not doing this, I might reconsider. I just don't have any timeline for it. In the meantime, use the function as a workaround.

jpsnyder commented 2 years ago

Fair enough. That workaround works for me. (And actually is better because I can customize the generated token to my liking.)

My only argument for accepting the MR, would be that it would seem to be expected behavior for token remapping to be case-insensitive if the user set that re flag to IGNORECASE. Principle of least surprises.

manux81 commented 2 years ago

I agree with @jpsnyder and also using reflags way is sound like flex approch.

jpsnyder commented 2 years ago

While I'm not sure on the performance hit for something like this, but would it make sense for the string value within the brackets for token remapping actually be a regex pattern like everything else? In which case, the regex pattern would be applied to the matches of the main token being remapped. (Make use of the .fullmatch() function)

This would solve the case sensitivity issue and provide more flexibility while still technically supporting the old way. It could also allow us to apply different re flags to the token mapping than what is set globally.

ID["if"] = IF
ID["else"] = ELSE
ID["(?i)while"] = WHILE   # case-insenstive mapping
ID["def[0-9]"] = DEF   # def followed by a number

On the other hand, this is probably feature creep...