Lexical analysis / tokenizer

kleinj commented 10 years ago

I stumbled upon some small potential issues for the lexical analysis. In the current version of the grammar specification, the following would be a valid acceptance condition: Acceptance: 3 F0 & F 1 & F!2 & I2

The syntax without spaces (F0) however would normally be tokenized as an identifier by the lexer. We could either

require a space between F or I and the integer, making F and I reserved keywords (priority over generic identifier rule)
use a context-sensitive lexer (different tokenization after an Acceptance: token until the next token)
use some unique symbols, e.g., <>[] / []<>
allow identifiers as atoms in acceptence condition, such that the identifier "F0" will be interpreted as F 0 (ugly in the grammar)
...?

I would like to avoid requiring a context-sensitive lexer. If we make F and I keywords (option 1) then we would have to allow them in the miscellaneous headers. The same applies to 't' and 'f' as well, which would also be considered as keywords by the lexer.

adl commented 10 years ago

What is your preference?

I was planning to use flex's start conditions to handle this, but I can understand why context-sensitive stuff could be inconvenient.

I would like to reorganize and augment your list of solutions with some other ideas:

solutions that work without changing the current grammar (I see all of those choices as implementation details left to the choice of the developer, although I agree parsing would be easer if it did not require advanced features or ugly hacks.)
1. context-sensitive scanner (e.g. flex's start conditions)
2. scan everything between Acceptance: and the next HEADER: or --BODY-- as a single token, then call a separate parser. This is tricky and makes error reporting a pain.
3. scanning [IF] as well as [iF][0-9]+ as some specific token types for use in the Acceptance: grammar rule, and adjusting the implementation of the other rules to allow these token in addition to identifiers and decoding [IF][0-9]+ only when appropriate in the parser. (I'm not calling those token keywords, because this solution is just one way to implement the given grammar.)
solutions that change the grammar
1. require space between [FI] and [0-9]+, making F and I keywords, and fixing the specification to also allow F and I everywhere IDENTIFIER is used. This makes the implementation details discussed in the previous solution explicit in our document, somehow forcing the implementation. The only win is that the number can decoded in the scanner.
2. replace F and I by characters that cannot be mistaken as identifiers (what about + and - ?)
3. Force acceptance set numbers to be enclosed in braces, as in Acceptance: 3 F{0} & (F {1} | F!{2})& I{2}). This way the scanner can match [IF][[:space:]]*!?{[0-9]+} as a token, and this wont conflict with an identifier. As braces are also used to specify acceptance sets in automata, there is some homogeneity.

I like the last two solutions better.

For t and f I would either:

leave the grammar as is, maybe with a warning that t and f are also valid identifiers (dealing with this doesn't really seem very difficult)
declare that t and f are BOOLEAN constants, cannot be used as identifiers, but can appear in header lines as in HEADERNAME (INT|STRING|IDENTIFIER|BOOLEAN)*.
state that t and f can only be used in guard as (f) or (t), and have the scanner recognize those tokens instead. This looks clumsy if we consider aliases as well. (I don't like this.)

kleinj commented 10 years ago

For t and f, my preference would be solution 2 (BOOLEAN).

For the acceptance condition, I have a preference for 2.ii (+0, -!2), even though it doesn't look that pretty. I'm not a fan of multiple tokenization (such as F{!0}with separate tokenization afterwards), as this restricts the places where comments or whitespaces can be placed. But I wouldn't object to needing a context-sensitive lexer, if there is a preference for allowing F0 and I0 for aesthetic reasons.

There is another question that occured to me just now: What should be the interpretation of acc-name: Ra--ABORT--? Is this just an identifier? I would be fine with that, as one can prepend a blank before --ABORT-- to abort an arbitry stream without risking to be inside an identifier. Another option would be to switch to ==BODY==, ==END== and ==ABORT==, then there is no ambiguity with identifiers. Either way, I don't think there is a way to insert a --ABORT-- into a non-parsed stream at an arbitrary location, as the stream could be in the middle of a comment or "quoted string". But that's probably not a big deal.

adl commented 10 years ago

Branch issue/32 contains two patches marking t/f as BOOLEAN and stating that --ABORT-- should be a separate token.

So we are left with the F and I stuff. I agree + and - aren't pretty. We should probably also aim for a solution that allow future extensions. For instance what if in the future we want to add support for "occurrence acceptance" where a transition or state has to be visited at least once?

How about this: use F and I as functions with parentheses F(0) & (F (1) | F(!2))& I(2)). The grammar would use identifiers instead of F or I:

acceptance-cond ::= IDENTIFIER "(" "!"? INT ")"
                  | (acceptance-cond)
                  | acceptance-cond & acceptance-cond
                  | acceptance-cond | acceptance-cond

Our document would only specify the meaning of the identifiers F and I.

kleinj commented 10 years ago

I like the F(0) & (F (1) | F(!2))& I(2)) syntax, that's a good, extensible solution.

In regard to IDENTIFIER: ([a-eg-su-zA-Z_][0-9a-zA-Z_-]*|[tf][0-9a-zA-Z_-]+), I would prefer keeping the old regexp and just stating that t and f are not identifiers. That's usually easy to ensure by the order of the lexical rules for the lexer.

adl commented 10 years ago

OK, I've reworked these patches not to change the IDENTIFIER regexp, and also changed all F(x) and I(x). Can you check those changes and see if you agree?

kleinj commented 10 years ago

Great, I fixed a small typo, otherwise this looks find. I would then as a next step extract the example automata in an examples subdirectory, so we can use them for testing parser implementations. Ok?

adl commented 10 years ago

Thanks for the proof reading. I've put those patches on master and deleted the branch.

adl / hoaf

Lexical analysis / tokenizer #32