dabeaz / sly

Sly Lex Yacc
Other
817 stars 108 forks source link

Forcing a fixed (but smaller) regex prior to m.match()/m.group() at a certain state? #23

Closed egberts closed 5 years ago

egberts commented 5 years ago

Maybe I bit more than I can chew, but I'm trying to develop a BIND9 configuration parser using Sly (and formerly Ply).

Basic problem is the Sly/Ply's auto-typing of multiple ID (identifiers) and whether I should generalize all my variable fields into just one ID type or not given the constrain put forth by sly (or ply) design.

BIND9 configuration is a weird comportment of C-style/Python-style comment, include statement, alias dictionary, multiple-LBRACE/RBRACE nesting, and ignoring newlines centered by using SEMICOLON as a statement terminator. I got all that working except for one: ID type discriminator (via multi-token regex).

include named-options.conf;
server example.com;

My first attempt to further subdivide/specialize that generic ID token was to break it up into multiple ID-type tokens and define SERVER ALIASNAME and INCLUDE FILESPEC using:

    t_SERVER_NAME = r'[A-Za-z0-9_\-\.]*'
    t_FILESPEC = r'([/\\:\._\-0-9A-Za-z]+)(?=[ \t]*;)'

I ran into that classic problem where a certain state is identifying the "ID" as a wrong token type.

After much reading of Google Group, StackOverflow, and GitHub forum/issues, I've concluded that any attempt to discriminate identifier (variable, aliasname, full domain name) is futile due to inability for regex to properly identify these identifiers.

Then I thought, why not at initialization time that I would forcibly pre-assign its smaller regex for that certain state (heck, for most states).

At any rate, I see three choices ahead of me:

  1. Is there a way to pre-select a lone (but smaller) token regex after entering into a next state instead of using the more generalized multi-token ID type identification regex?

  2. Or is verification of its variable naming convention (using just t_ID) best done inside the state-specific parser function (ie., p_clause_server and p_clause_include) and not at token-level (ie., t_FILESPEC and t_SERVER_NAME)?

  3. Or did I overlook another tip?

If I can nail this, NGINX configuration file format will soon follow and I can post the result in its entirety here in GitHub for other security researchers to use.

dabeaz commented 5 years ago

Defining overlapping regexes (i.e., patterns that match the same text) is always going to be problematic to manage. It's probably better to handle it using a single tokenizing rule (i.e., t_ID) and to handle special cases there.