PureBasic Lexer - Githubissues

Hi Sicro,

I'm really impressed with your new Lexer, well done!

I've run-tested it and quickly sifted through the code, really interesting.

RegEx for PB Numerals

I only have some reservations about the numerals RegEx of the PB Lexer example:

      regEx$ = "[0-9]+(?:\.[0-9]+(?:e[0-9]+)?)?" + ; Integers and decimal numbers
               "|" +
               "%[01]+" + ; Binary numbers
               "|" +
               "\$[0-9A-F]+" + ; Hexadecimal numbers
               "|" +
               "'.*?'" ; 'a'

The floats RegEx doesn't seem to cover the way PureBasic defines floating point numbers. Although PureBasic handles floats according to IEEE 754, when it comes to the actual notation of floats the PB compiler has a rather unorthodox approach to what constitutes a valid float number notation. This will compile and work fine in PB:

; Who said IEEE 754 can't be fun?

float1 = 1.575e2bananas ; --> Yes, "floating bananas" are valid in PureBasic!
float2 = 1.575e2

; Compile me if you don't believe me!

If float1 = float2
  Debug "Test passed!"
EndIf

It looks like PB Compiler at Lexing time simply eats-up and ignore any letters after the float number (which in a scientific notation could be a d, etc.). Your RegEx doesn't seem to contemplate letter suffixes at all.

Also, having written a few syntax highlighter for PureBasic, during extensive tests I've managed to catch a few edge cases that allowed me to refine the various RegExs, and as a rule of thumb numerical constants should be captured in this order when using a single RegEx with piped alteratives:

Hex
Binary
Float
Decimal

This is to avoid some edge cases in which a non-decimal gets parsed as a decimal.

Here's the RegEx I've used for Highlight PB syntax (Boost RegEx engine):

Digits=[[ (?x)
        # ============ HEX ============
        # Pascal style ($FF):

          \$[0-9a-fA-F]+\b

        # ============ BINARY ============
        | %[01]+\b

        # ============ FLOAT ============
        # With decimal point separator:

        | \b[-]?\d+\.\d+(?:[eE][\-\+]?\d+)?[a-zA-Z]*\b

        # Without decimal point separator:

        | \b[-]?\d+(?:[eE][\-\+]?\d+)[a-zA-Z]*\b

        # ============ DECIMAL ============
        | (?<!\$)\b\d+\b

       ]]

This is the RegEx used in Sublime-PureBasic for floats (proprietary RegEx engine):

  num_flt: |
    (?x)
    \b[-]?\d+
    (
        \.\d+ (?:[eE][\-\+]?\d+)?   # With decimal point separator
    |         (?:[eE][\-\+]?\d+)    # Without decimal point separator
    )[a-zA-Z]*\b

PB Syntaxes References and Tests

Hoping they might be of some use, I'm pasting here some links to PB syntaxes definitions I've written for various highlighters, were you can look at the RegExs I've used to circumvent edge-cases:

I'm also working on a Highlight syntaxes test suite, where I've started to create some PureBasic test files:

Highlight Test-Suite: PureBasic

And there also the Sublime Text test files for PureBasic syntax:

Sublime PureBasic tests folder

Neither of them is yet mature, but I'll be adding more tests in the course of time, so it might be worth checking them every now and then. I have created quite a lot of PureBasic test files while working on those syntaxes, but I need to clean them up before I can publish them on these test suites. The edge cases I came across are quite numerous, and they can be very tricky to handle via RegExs.

Hi Tristano

It looks like PB Compiler at Lexing time simply eats-up and ignore any letters after the float number (which in a scientific notation could be a d, etc.). Your RegEx doesn't seem to contemplate letter suffixes at all.

I didn't know that the official Lexer of PB allows and ignores the suffixing of letters in floating point numbers.

I don't want to support this behaviour on my Lexer, because I think it is a faulty processing of the official Lexer. For example, if someone accidentally writes the letter "O" instead of the number "0", the number "0" is missing in the floating point number -- all subsequent numbers will also be ignored. The error will be very hard for the programmer to find.

Also, having written a few syntax highlighter for PureBasic, during extensive tests I've managed to catch a few edge cases that allowed me to refine the various RegExs, and as a rule of thumb numerical constants should be captured in this order when using a single RegEx with piped alteratives:
1. Hex
2. Binary
3. Float
4. Decimal
This is to avoid some edge cases in which a non-decimal gets parsed as a decimal.

I don't think my Lexer has this problem, because it compares all RegEx from the beginning of the string.

Thanks for the links to PB syntaxes references and tests

I don't want to support this behaviour on my Lexer, because I think it is a faulty processing of the official Lexer.

Defintely not very compliant to the scientific notation, but I think that it was done because PB ignores those suffixes since it auto-handles sizes in the background, but at the same time it wanted to support suffixes in case they were being used (different languages/compilers seem to suport different suffixes, so there isn't a stric standard when it comes to which ones should be supported, although the letters employed are somewhat standardized).

For example, if someone accidentally writes the letter "O" instead of the number "0", the number "0" is missing in the floating point number -- all subsequent numbers will also be ignored. The error will be very hard for the programmer to find.

You could make a special case test for the "O" (which is not a used suffix), but if you reject trailing chars the risk is that you'd break up parsing. I think that users might be adopting prefixes for code clarity, especially when porting from other languages.

The counter argument here is that if it compiles than it's valide PB code, and you'd have to account for it.

I don't think my Lexer has this problem, because it compares all RegEx from the beginning of the string.

Test how it works with these edge cases :

; If modulo operator is not separated by space, and following number is made up
; of only "1" and "0" digits, it will be parsed as if it was a binary number:
modulo_a = 100 % 10 ; <= 100 modulo 10
modulo_a = 100%10   ; <= ... becomes number
modulo_a = 100 %10  ; <= ... becomes"100" and "%10"
modulo_a = 100% 10  ; <= 100 modulo 10

; ... not so if the number has digits other than "1" and "0":
modulo_a = 100%103 ; <= 100 modulo 103

The PB IDE fails to handle these well in highlighting, which breaks up syntax coloring (should set numbers to show in some obvious colour in the IDE to see this happening, because by default they're just black).

different languages/compilers seem to suport different suffixes

I thought about it again and decided to also support the peculiarities of the PureBasic programming language. If other programming languages also support suffixes for floating point numbers, everything is ok. Suffixes for floating point numbers are new for me and I don't know what they are useful for.

You could make a special case test for the "O" (which is not a used suffix), but if you reject trailing chars the risk is that you'd break up parsing.

I will test which characters the PB compiler accepts as suffixes for floating point numbers and then adopt them to my PBLexer.

Test how it works with these edge cases :
; If modulo operator is not separated by space, and following number is made up
; of only "1" and "0" digits, it will be parsed as if it was a binary number:
modulo_a = 100 % 10 ; <= 100 modulo 10
modulo_a = 100%10   ; <= ... becomes number
modulo_a = 100 %10  ; <= ... becomes"100" and "%10"
modulo_a = 100% 10  ; <= 100 modulo 10

; ... not so if the number has digits other than "1" and "0":
modulo_a = 100%103 ; <= 100 modulo 103
The PB IDE fails to handle these well in highlighting, which breaks up syntax coloring (should set numbers to show in some obvious colour in the IDE to see this happening, because by default they're just black).

When I run my PBLexer over your code, I get the following output:

[Comment]: ; If modulo operator is not separated by space, and following number is made up
[NewLine]: 
[Comment]: ; of only "1" and "0" digits, it will be parsed as if it was a binary number:
[NewLine]: 
[Identifier]: modulo_a
[Operator]: =
[Number]: 100
[Operator]: %
[Number]: 10
[Comment]: ; <= 100 modulo 10
[NewLine]: 
[Identifier]: modulo_a
[Operator]: =
[Number]: 100
[Number]: %10
[Comment]: ; <= ... becomes number
[NewLine]: 
[Identifier]: modulo_a
[Operator]: =
[Number]: 100
[Number]: %10
[Comment]: ; <= ... becomes"100" and "%10"
[NewLine]: 
[Identifier]: modulo_a
[Operator]: =
[Number]: 100
[Operator]: %
[Number]: 10
[Comment]: ; <= 100 modulo 10
[NewLine]: 
[NewLine]: 
[Comment]: ; ... not so if the number has digits other than "1" and "0":
[NewLine]: 
[Identifier]: modulo_a
[Operator]: =
[Number]: 100
[Number]: %10
[Number]: 3
[Comment]: ; <= 100 modulo 103

Perhaps it would be better if the lexer always lexicalizes the character "%" as an operator and the parser later determines whether it really is an operator or the start character of a binary number.

I will probably remove the token type "PassThroughASM" from my PBLexer and have it processed later by my PBParser.

I have found that operators that consist of two characters can also contain spaces:

If 1 <      > 2
  Debug "true"
EndIf

If 1 <   = 2
  Debug "true"
EndIf

I've read a lot about syntax highlighting in the meantime. Some editors highlight the code in two steps: First, the code is highlighted easily and quickly with the help of the lexer. Later, when the programmer hasn't typed anything for a certain time, the slow parser is run over the code to highlight the code more precisely. Read here: https://stackoverflow.com/questions/20200784/does-a-syntax-highlighter-in-an-ide-scan-the-whole-file-every-time-a-letter-is-t/20200963#20200963

I noticed that my PBLexer needs much more time than the PB compiler for the complete compilation process, but I don't want to take care of optimizations until my PBLexer and PBParser are as complete as possible.

I thought about it again and decided to also support the peculiarities of the PureBasic programming language. If other programming languages also support suffixes for floating point numbers, everything is ok.

I think it's a good idea, after all PureBasic being permissive regard is not a demerit (although it might not be "standard compliant"), after all it allows people who are used to suffixes notation from other languages to employ them safely in their PB sources, so I don't really see anything wrong with it, it's an added benefit (although it might look quite odd).

Suffixes for floating point numbers are new for me and I don't know what they are useful for.

I find the whole scientific notation overtly complex (probably that's why it's called "scientific", as in "rocket science" :scream:), and to be honest I've only really came across it when having to write syntax highlighters definitions (which made the work a real pain). I guess is one of those things that either you need to use it a lot, and you become familiar with it, or it's just a mistery that pops up in various articles now and then. The fact that every language employs a slightly different notation doesn't make things easier either, but then (again) it makes sense in the context it's being used.

When I run my PBLexer over your code, I get the following output:
...

[Number]: 100
[Operator]: %
[Number]: 10
code

That's good! the PB IDE fails to parse 100%10 properly, your code is doing a better job.

...

[Operator]: =
[Number]: 100
[Number]: %10
[Comment]: ; <= ... becomes"100" and "%10"

...

[Operator]: =
[Number]: 100
[Operator]: %
[Number]: 10
[Comment]: ; <= 100 modulo 10

These are trickier (100 %10 and 100% 10). In the semantic context it should be number+operator+number (and the compiler will interpret it thus), but since it's written in a bad format in the source example, and general purpose parser (like highlighters) don't usually get so semantic about input, I guess it's more than fine (unless you really need to know what the code is doing). Authors shouldn't write the modulo operator without a separating space, but since the compiler allows they might as well do (which is bad habit in situation where it could be ambiguos). The argument in favour of writing it thus are that we usually think is OK to write:

x = a +5   ; instead of: a + 5
y = c -1   ; instead of: c - 1

; counter examples:
w = d - +1 
z = e + -1

But how is a parser going to know if the - stands for a sign operator or for subtraction? The PB IDE highlighter and the compiler lexer take obviously different approaches here, the latter being smarter (and not fooled by additional or lacking spaces).

Perhaps it would be better if the lexer always lexicalizes the character "%" as an operator and the parser later determines whether it really is an operator or the start character of a binary number.

Again, it depends what you need to do with the parsed data, and whether you need to differentiate between the various contexts (and, in the previous example, distinguish between % as a modulo operator and a prefix for binary numbers). If you need semantic accuracy, then yes: you'll probably need to write some algorithm that checks the extracted data to disambiguate context.

I have found that operators that consist of two characters can also contain spaces:

Usually compilers are much smarter to contextualize whitespace than, say a syntax highlighter or a syntax definition for and editor, where the former usually relies on a true parser-builder tool (like ANTLR, YACC, etc.), while the latter usually take a simpler regex approach. Compilers can take all the time in the world they need to produce output, editors syntax highlighters on the other hand must work fast because the user is working live on the very source they are highlighting. In general, we don't expect editor's highlighters to offer a perfect representation of the syntax.

I've read a lot about syntax highlighting in the meantime. Some editors highlight the code in two steps: First, the code is highlighted easily and quickly with the help of the lexer. Later, when the programmer hasn't typed anything for a certain time, the slow parser is run over the code to highlight the code more precisely.

Thanks for the link! Yes, there are different approaches to editor highlighting, and good editors (like Sublime Text) really shine here because you can open the IDE with the last saved session and 50 files with over 2000 lines each, and in the 10 seconds the IDE is open, ready and everything is highlighted — and this, even though Sublime Text syntaxes are RegEx based, with RegEx being notriously down-slowers of performance. The trick is caching, good caching, and more caching (and good binary optimization at compile time, of course). Sublime Text authors have even written their own RegEx engine focusing on syntax highlighting tasks only, to further seepd up things.

Now the "big thing" of the moment seems to be shifting toward Language Server Protocol, which are providing cross-editor advanced functionality for languages support:

The VSCode syntax for PureBasic you emailed me about a few monthes ago, is currently working on a PB Language Server, which (once ready) could be used in any editor really:

(as you can see from the double links, the problem here is that there is one author but two packages. Which one is the "correct one" is anybody's guess)

Parsers, Lexer and Tokenizers are a really hard subject, and probably one could work on them a lifetime and still learn something new everyday. My knowledge of math is very poor, so I can't really access the literature on parsers and their algorithms, for they contain lots of math which is far beyond my comprehension. My understanding of algorithms doesn't get beyond understanding the Big O notation — I wish it did, but it doesn't.

I've purchased an online video course on algorithms, Learning Data Structures and Algorithms, by Rod Stephens (author of the famous book Essential Algorithms) which prooved to be easier to learn than most academic books (including the book by the course author), and comes with practical examples (easily appliable to PB too).

I usually don't like video courses, and preferr books, but some subjects do benifit from visual animations. Eg. the excellent Fasm introductory course by XORPD (a well know coder in the Fasm community).

I noticed that my PBLexer needs much more time than the PB compiler for the complete compilation process, but I don't want to take care of optimizations until my PBLexer and PBParser are as complete as possible.

Chances are you'll never beat the compiler speed for the same source, for two reasons:

Using RegEx is computationl expensive (very expensive).
The PB compiler is not written in PB, but probably a mixture of Fasm and MSVS/GCC, which are all optimizing compilers.

The classic approach to building parsers (with BNF syntaxes, and a FSM compiler/lexer generator) will always be fast because it parses character by character — and usually only accounts for Ascii characters in everything but strings (where strings only need checking for escape sequences and terminators).

Editor highlighters can get aways with RegEx based definitions because:

The syntaxes are precompiled during editor usage (so are their RegEx).
Buffers and caching takes care to ensure that only edited text is highlighted again.
Good editor implement custom RegEx engines to allow fast parsing of the source.

So, if you want to achieve a fast RegEx parser, you should probably not use the native PB PCRE engine, but a custom version compiled for Ascii only (possibly PCRE2, or oniguruma) for you don't really need Unicode here (unless you need to know what's inside strings literals).

In the last year I've come to full realization of how poor a job PB does at compiler optimization, compared to other languages. That's a barrier which knows no solution, unfortunately. Especially 64 bit compilation, which produces a lot of useless stack operations. So, you can't really expect PB binaries to ever be optimized for speed (ie. unless you're willing to edit the ASM output every time). I guess that in an era where everyone seems to love using script languages this is not really an issue (Python being 13 times slower than C).

If you plan to parse the same source over and over, you could optimize using caches, or some incremental parsing techique mimicking incremental compiler optizimations (like Nim's), but probably it's not worth it. You can always polish your RegExs, making them non-greedy whenever possibile; I think there are RegEx optimizers out there, especially for PCRE, being so popular. Good RegEx can make big differences in perfomance:

https://www.rexegg.com/regex-optimizations.html

... but overall, a RegEx based tool will always be somewhat slow.

Still, it's worth trying. After all, it's a fun experiment from which you're going to learn a lot of interesting (and important) things. As I mentioned, from my previous works on syntax highlighters, I've always wanted to try and work on parsers, because they are fascinating and (after all) they are at the core of every computer language and editor.

Please, keep me up to date with your progress! Although I'm currently burried in a mountain of work, I haven't lost interest nor faith in this great project of yours.

Thank you very much for all the interesting information.

In the beginning, my Lexer was much slower, because always the complete, not lexicalized code was passed as a string to the PB function "ExamineRegularExpression". The function doesn't accept an address to a string, so the string is copied to the PB function every time and this slows down the process enormously. I was able to loosen the brake as much as possible by no longer passing the entire remaining code as a string to the PB function, but only a certain length of the string. The length can be set with the optional parameter "maxTokenValueLength", when creating the lexer: https://github.com/SicroAtGit/PureBasic-CodeArchiv-Rebirth/blob/c37309b42e881f261d9029c9c307b5c16bcb0963/Lexer/Lexer.pbi#L102-L112 By using the above trick, I am happy with the speed of the Lexer.

Please, keep me up to date with your progress! Although I'm currently burried in a mountain of work, I haven't lost interest nor faith in this great project of yours.

I'm really pleased. I will keep you up to date.

The function doesn't accept an address to a string, so the string is copied to the PB function every time and this slows down the process enormously.

I can imagine. I had written a small markdown parser in PB and I faced the same problem. You definitely want to pass only string pointers among the various functions, and leave the original string untouched — i.e. just examine substrings by moving the pointer and using a variable to determine how many chars to read from the pointed string.

If you handle all string sanitation at the beginning (ie. substitution tabs with space, normalizing EOLs, etc.) then you can safely do all the rest of the work on the sanitized string by means of pointers and, if you need to manipulate the string, you can always create substrings on the fly. But passing the whole string to a function is going to slow down the program.

You could even build a table of substring pointers and chars-length, so if you need to explore different paths and backtrack you can do it very fast and at low memory cost. Probably by chuncking the original (full) string into a table of (pointer) substrings at the beginning of the parsing process will speed up consirably the whole work, and you could then loop process each cunck programmatically and remove the processed chuncks from the table as they're dealt with.

Also, because PB doesn't offer tail recursion optimization, the risk is that if you need to carry out recursive operations you'll end up with memory bloats — but if you pass the strings by reference then the bloat will be limited to the procedure call frame.

In this type of code, using Goto and Gosub instead of procedure calls would be much better and faster, after all the need for tail recursion optimization is only really a problem in languages which don't have the Goto instruction — see discussion on ycombinator.com:

A tail call is basically goto with arguments. You can use it to build arbitrary extensible state machines. Like a goto, it's useful but potentially confusing. You shouldn't use it when a more structured loop construct will work.

So, ultimately, if you manage to create a set of rules for the state machine you could probably avoid using procedure calls and use Select blocks instead, with Gosub and Goto in the Case statements. Doing so will make the code faster because:

Variables won't need to be Globals.
No stack frames are created for procedure calls.
Use of Select blocks will reduce code branching considerably.
Operations on pointers are faster because they always have the native size of integer, so the machine instructions that deal with them eat up less clock cycles, and they won't produce memory-alignment problems.

By looking at the ASM compiled output, you can always leverage OoOE and fit as many calculations as you can between two-blocking operations:

Software optimization resources by Agner:
- Optimizing subroutines in assembly language: optimizing_assembly.pdf

The good thing about PB in this respect is that ASM compilation preview allows you to verify that the final machine code instructions are as you want them to be.

Thank you for your suggestions and information.

I have made some improvements to the PBLexer recently. Now it supports the syntax of the PureBasic programming language quite well, I think.

The speed of the Lexers/PBLexer currently has no priority for me, but I will probably work on it again sometime.

Ciao Sicro,

just adding a few more useful links on the topic ... Also, have a look at my recent project:

https://github.com/tajmone/lemon-grove

The Lemon parser generator can (and has) been tweaked to produce parsers for other languages too, so it might be adapted to produce PB parsers. Lemon has a templated based approach for the final parser code generation.

Articles

So You Want To Write Your Own Language? — article by Walter Bright, creator of D language.
[Regular expressions in lexing and parsing ] — article by Rob Pike, explaining why RegEx shouldn't be used for lexing/parsing.
I wrote a programming language. Here’s how you can, too. — by William W Wold. Offers interesting insights on various problems and solutions encountered.

Tutorials

Make Your Own Programming Language (1st Ed.) — by Felix Pleşoianu. Free online tutorial on how to create a simple language in JavaScript.
Implementing A Scripting Engine — by Jan Niestadt, excellent multi-part tutorial, goes into details.
Writing a C Compiler, Part 1 — by Nora Sandler. The interesting aspect here is that it uses Abdulaziz Ghuloum’s An Incremental Approach to Compiler Construction, which is something that should work well with PureBASIC inline Assembly (IMO). Worth looking into it.

Books

This is an amazing book on how to create your own languages:

Language Implementation Patterns — By Terence Parr.

It's actually a readable book, not one of those highlgy-techinical books for academia initiates; the author simplifies the topic, but its contents remain real high quality. It's a must have!

Make Your Own Programming Language (2nd Ed.) — by Felix Pleşoianu. This is eBook costs only $2 and it's a good tutorial from the ground up on how to create a BASIC language; comes with code examples (Python). It's basically a rewrite of the 1st edition, using Python instead, and with improved text.

Thanks! I'll have a look at it all soon.

SicroAtGit / PB-CodeArchiv-Rebirth

PureBasic Lexer #7

RegEx for PB Numerals

PB Syntaxes References and Tests

Articles

Tutorials

Books