Closed kleinj closed 10 years ago
What is your preference?
I was planning to use flex
's start conditions to handle this, but I can understand why context-sensitive stuff could be inconvenient.
I would like to reorganize and augment your list of solutions with some other ideas:
flex
's start conditions)Acceptance:
and the next HEADER:
or --BODY--
as a single token, then call a separate parser. This is tricky and makes error reporting a pain.[FI]
and [0-9]+
, making F
and I
keywords, and fixing the specification to also allow F
and I
everywhere IDENTIFIER
is used. This makes the implementation details discussed in the previous solution explicit in our document, somehow forcing the implementation. The only win is that the number can decoded in the scanner.F
and I
by characters that cannot be mistaken as identifiers (what about +
and -
?)Acceptance: 3 F{0} & (F {1} | F!{2})& I{2})
. This way the scanner can match [IF][[:space:]]*!?{[0-9]+}
as a token, and this wont conflict with an identifier. As braces are also used to specify acceptance sets in automata, there is some homogeneity. I like the last two solutions better.
For t
and f
I would either:
t
and f
are also valid identifiers (dealing with this doesn't really seem very difficult)t
and f
are BOOLEAN constants, cannot be used as identifiers, but can appear in header lines as in HEADERNAME (INT|STRING|IDENTIFIER|BOOLEAN)*
.t
and f
can only be used in guard as (f)
or (t)
, and have the scanner recognize those tokens instead. This looks clumsy if we consider aliases as well. (I don't like this.)For t
and f
, my preference would be solution 2 (BOOLEAN
).
For the acceptance condition, I have a preference for 2.ii (+0
, -!2
), even though it doesn't look that pretty. I'm not a fan of multiple tokenization (such as F{!0}
with separate tokenization afterwards), as this restricts the places where comments or whitespaces can be placed. But I wouldn't object to needing a context-sensitive lexer, if there is a preference for allowing F0 and I0 for aesthetic reasons.
There is another question that occured to me just now: What should be the interpretation of
acc-name: Ra--ABORT--
?
Is this just an identifier? I would be fine with that, as one can prepend a blank before --ABORT--
to abort an arbitry stream without risking to be inside an identifier. Another option would be to switch to ==BODY==
, ==END==
and ==ABORT==
, then there is no ambiguity with identifiers. Either way, I don't think there is a way to insert a --ABORT--
into a non-parsed stream at an arbitrary location, as the stream could be in the middle of a comment or "quoted string"
. But that's probably not a big deal.
Branch issue/32 contains two patches marking t
/f
as BOOLEAN and stating that --ABORT--
should be a separate token.
So we are left with the F
and I
stuff. I agree +
and -
aren't pretty. We should probably also aim for a solution that allow future extensions. For instance what if in the future we want to add support for "occurrence acceptance" where a transition or state has to be visited at least once?
How about this: use F
and I
as functions with parentheses F(0) & (F (1) | F(!2))& I(2))
. The grammar would use identifiers instead of F
or I
:
acceptance-cond ::= IDENTIFIER "(" "!"? INT ")"
| (acceptance-cond)
| acceptance-cond & acceptance-cond
| acceptance-cond | acceptance-cond
Our document would only specify the meaning of the identifiers F
and I
.
I like the F(0) & (F (1) | F(!2))& I(2))
syntax, that's a good, extensible solution.
In regard to IDENTIFIER
: ([a-eg-su-zA-Z_][0-9a-zA-Z_-]*|[tf][0-9a-zA-Z_-]+)
, I would prefer keeping the old regexp and just stating that t
and f
are not identifiers. That's usually easy to ensure by the order of the lexical rules for the lexer.
OK, I've reworked these patches not to change the IDENTIFIER
regexp, and also changed all F(x)
and I(x)
. Can you check those changes and see if you agree?
Great, I fixed a small typo, otherwise this looks find. I would then as a next step extract the example automata in an examples
subdirectory, so we can use them for testing parser implementations. Ok?
Thanks for the proof reading. I've put those patches on master and deleted the branch.
I stumbled upon some small potential issues for the lexical analysis. In the current version of the grammar specification, the following would be a valid acceptance condition: Acceptance: 3 F0 & F 1 & F!2 & I2
The syntax without spaces (F0) however would normally be tokenized as an identifier by the lexer. We could either
I would like to avoid requiring a context-sensitive lexer. If we make F and I keywords (option 1) then we would have to allow them in the miscellaneous headers. The same applies to 't' and 'f' as well, which would also be considered as keywords by the lexer.