Scanner appears to incorrectly classify input in Pharo 7

apblack commented 6 years ago

I'm running SmaCC in Pharo 7 (development branch).

The input character (currentCharacter) is ⟦.

The first few lines of generated method scan1X21 read:

scan1X21
    currentCharacter isOpenPunctuation
        ifTrue: [ ^ self recordAndReportMatch: #(20) ].

⟦ is indeed an open punctuation character, so it is classified as being a symbol 20 — which is "{", according to the symbols table. Not surprisingly, the parse fails.

This seemed to work fine in Pharo 6.

The appropriate scanner definitions are

<openTP>: ⟦ | \[\[ ;
<closeTP>: ⟧ | \]\] ;

which are symbols 49 and 50.

What I can't understand is why the generated scan method is even using the isOpenPunctuation predicate. Most of the other tests in the scanner are for individual characters. In fact, further down in the same scan method we find

currentCharacter =
    (Character value: 10214)
    ifTrue: [ ^ self recordAndReportMatch: #(49) ].
currentCharacter =
    (Character value: 10215)
    ifTrue: [ ^ self recordAndReportMatch: #(50) ].

10214 and 10215 are the codepoints for ⟦ and ⟧.

apblack commented 6 years ago

I found out where the isOpenPuctuation predicate is coming from.

SmaCC includes a feature whereby it tries to find a "closely matching" Character >> #isXXX method to test the next character. This is intended to yield more compact and more readable scanners. It works by exhaustive enumeration of the characters recognized by each predicate; this is probably a bad idea when the alphabet is large (as with Unicode).

The code that implements this feature (in SmaCCSmalltalkCodeGenerator >> #closestIsExpressionsFor:seen:) also seems to be wrong — which is why I'm getting these parse errors.

Rather than trying to debug this code, I patched the class initialization method in SmaCCSmalltalkCodeGenerator to make the set of is-expressions to be considered as candidates to be empty, effectively turning off this feature.

initialize
    MaxJumpSize := 900.
    MaxLiterals := 30.
 +  self isExpressions: #()

This seems to generate a correct scanner, although it will make tests like this:

    (currentCharacter == $'
        or: [ (currentCharacter between: $0 and: $9)
                or: [ (currentCharacter between: $A and: $Z)
                        or: [ currentCharacter == $_
                                or: [ (currentCharacter between: $b and: $z)
                                        or: [ ('µæ' includes: currentCharacter)
                                                or: [ currentCharacter =
                                                        (Character value: 960) ] ] ] ] ] ])

Unfortunately, turning off this feature still doesn't allow me to use <isLetter> | <isDigit> in my scanner definitions. Even if I put

self isExpressions: #(#isLetter #isDigit)

SmaCC seems to run forever (well, for more than 40 minutes) on my grammar when I use these (Unicode) Predicates in the scanner.

ThierryGoubier commented 6 years ago

@apblack , are you able to profile / interrupt it to see where it is spending that much time? You're still working with the %unicode option?

apblack commented 6 years ago

Yes, with Unicode (or at least the subset of Unicode < 16r2FFF).

I tried interrupting; the story is much the same as before — no one method is taking all the time. There are just many, many nested executions of #do: on various things ... mostly looking on CharacterEdges.

This was what led me to to suspect that the slowness is caused by the code that looks for sets of edges in an NFA with the same label, and to look at alternative ways of computing those sets.

ThierryGoubier commented 6 years ago

Yes, I understand that. We both sort of know what has to be done, either via bdd or character ranges. However, you're also pointing out the fact the code generation will be impacted as well: in the past we had issues with methods with too many literals (or too many bytecodes).

apblack commented 6 years ago

What this issue is pointing out is that, aside from performance issues, I suspect that the code that implements character tests by means of "closest is-expressions" is just plain wrong. That's the only description that I can think of for a scanner that is supposed to be looking for a ⟦ is matching on a {.

This may not have anything to do with Unicode, or it may be that the "closest match" algorithm works only for small character sets. The difference between Pharo 7 and Pharo 6 here is that Pharo 7 implements 29 additional isXXX methods on character — the predicates that define the 29 Unicode character categories.

apblack commented 6 years ago

As a workaround for this issue, I've added the following line to the class-side initialize method of SmaCCSmalltalkCodeGenerator in my working branch of SmaCC

self isExpressions: #()

What this does is make the set of is-expressions used by the Smalltalk code generator empty. This effectively turns off the code that looks for closely-matching is-expressions for given character sets (the code that I suspect to be buggy).

With the above initialization, the scanner works. Without it, the scanner miss-recognizes tokens. Generation is also significantly faster, which is not surprising. I suggest that either the is-expression feature be removed, or that it be turned off automatically when %unicode is on.

ThierryGoubier commented 6 years ago

Solved with 401ebad

SmaCCRefactoring / SmaCC

Scanner appears to incorrectly classify input in Pharo 7 #47