Closed apblack closed 6 years ago
I found out where the isOpenPuctuation
predicate is coming from.
SmaCC includes a feature whereby it tries to find a "closely matching" Character >> #isXXX
method to test the next character. This is intended to yield more compact and more readable scanners. It works by exhaustive enumeration of
the characters recognized by each predicate; this is probably a bad idea when the alphabet is large (as with Unicode).
The code that implements this feature (in SmaCCSmalltalkCodeGenerator >> #closestIsExpressionsFor:seen:
) also seems to be wrong — which is why I'm getting these parse errors.
Rather than trying to debug this code, I patched the class initialization method in SmaCCSmalltalkCodeGenerator
to make the set of is-expressions to be considered as candidates to be empty, effectively turning off this feature.
initialize
MaxJumpSize := 900.
MaxLiterals := 30.
+ self isExpressions: #()
This seems to generate a correct scanner, although it will make tests like this:
(currentCharacter == $'
or: [ (currentCharacter between: $0 and: $9)
or: [ (currentCharacter between: $A and: $Z)
or: [ currentCharacter == $_
or: [ (currentCharacter between: $b and: $z)
or: [ ('µæ' includes: currentCharacter)
or: [ currentCharacter =
(Character value: 960) ] ] ] ] ] ])
Unfortunately, turning off this feature still doesn't allow me to use <isLetter> | <isDigit>
in my scanner definitions. Even if I put
self isExpressions: #(#isLetter #isDigit)
SmaCC seems to run forever (well, for more than 40 minutes) on my grammar when I use these (Unicode) Predicates in the scanner.
@apblack , are you able to profile / interrupt it to see where it is spending that much time? You're still working with the %unicode option?
Yes, with Unicode (or at least the subset of Unicode < 16r2FFF).
I tried interrupting; the story is much the same as before — no one method is taking all the time. There are just many, many nested executions of #do: on various things ... mostly looking on CharacterEdges.
This was what led me to to suspect that the slowness is caused by the code that looks for sets of edges in an NFA with the same label, and to look at alternative ways of computing those sets.
Yes, I understand that. We both sort of know what has to be done, either via bdd or character ranges. However, you're also pointing out the fact the code generation will be impacted as well: in the past we had issues with methods with too many literals (or too many bytecodes).
What this issue is pointing out is that, aside from performance issues, I suspect that the code that implements character tests by means of "closest is-expressions" is just plain wrong. That's the only description that I can think of for a scanner that is supposed to be looking for a ⟦ is matching on a {.
This may not have anything to do with Unicode, or it may be that the "closest match" algorithm works only for small character sets. The difference between Pharo 7 and Pharo 6 here is that Pharo 7 implements 29 additional isXXX
methods on character — the predicates that define the 29 Unicode character categories.
As a workaround for this issue, I've added the following line to the class-side initialize method of SmaCCSmalltalkCodeGenerator
in my working branch of SmaCC
self isExpressions: #()
What this does is make the set of is-expressions used by the Smalltalk code generator empty. This effectively turns off the code that looks for closely-matching is-expressions for given character sets (the code that I suspect to be buggy).
With the above initialization, the scanner works. Without it, the scanner miss-recognizes tokens.
Generation is also significantly faster, which is not surprising.
I suggest that either the is-expression feature be removed, or that it be turned off automatically when %unicode
is on.
Solved with 401ebad
I'm running SmaCC in Pharo 7 (development branch).
The input character (currentCharacter) is ⟦.
The first few lines of generated method scan1X21 read:
⟦ is indeed an open punctuation character, so it is classified as being a symbol 20 — which is "{", according to the symbols table. Not surprisingly, the parse fails.
This seemed to work fine in Pharo 6.
The appropriate scanner definitions are
which are symbols 49 and 50.
What I can't understand is why the generated scan method is even using the isOpenPunctuation predicate. Most of the other tests in the scanner are for individual characters. In fact, further down in the same scan method we find
10214 and 10215 are the codepoints for ⟦ and ⟧.