Closed egberts closed 5 years ago
Defining overlapping regexes (i.e., patterns that match the same text) is always going to be problematic to manage. It's probably better to handle it using a single tokenizing rule (i.e., t_ID) and to handle special cases there.
Maybe I bit more than I can chew, but I'm trying to develop a BIND9 configuration parser using Sly (and formerly Ply).
Basic problem is the Sly/Ply's auto-typing of multiple ID (identifiers) and whether I should generalize all my variable fields into just one ID type or not given the constrain put forth by
sly
(orply
) design.BIND9 configuration is a weird comportment of C-style/Python-style comment, include statement, alias dictionary, multiple-LBRACE/RBRACE nesting, and ignoring newlines centered by using SEMICOLON as a statement terminator. I got all that working except for one: ID type discriminator (via multi-token regex).
My first attempt to further subdivide/specialize that generic ID token was to break it up into multiple ID-type tokens and define
SERVER ALIASNAME
andINCLUDE FILESPEC
using:I ran into that classic problem where a certain state is identifying the "ID" as a wrong token type.
After much reading of Google Group, StackOverflow, and GitHub forum/issues, I've concluded that any attempt to discriminate identifier (variable, aliasname, full domain name) is futile due to inability for regex to properly identify these identifiers.
Then I thought, why not at initialization time that I would forcibly pre-assign its smaller regex for that certain state (heck, for most states).
At any rate, I see three choices ahead of me:
Is there a way to pre-select a lone (but smaller) token regex after entering into a next state instead of using the more generalized multi-token ID type identification regex?
Or is verification of its variable naming convention (using just
t_ID
) best done inside the state-specific parser function (ie.,p_clause_server
andp_clause_include
) and not at token-level (ie.,t_FILESPEC
andt_SERVER_NAME
)?Or did I overlook another tip?
If I can nail this, NGINX configuration file format will soon follow and I can post the result in its entirety here in GitHub for other security researchers to use.