[Help wanted] Turn lexer back into flex definition

ISSOtm commented 4 years ago

The lexer is currently a 1 kloc (+ 800 kloc if you count globlex) monstrosity of a modified auto-generated file. This should really be changed back to a flex definition somehow. The modifications made to it might require a custom program skeleton (probably, I think, but eh), but that would be certainly much better than what we currently have.

ISSOtm commented 4 years ago

This is actually progressing—who would've guessed.

I have currently the following left to implement (at least, that I'm aware of):

Symbol interpolation in strings (in progress)
Block scanning (macros, if, rept)
"Naked" symbol interpolation

We do have a problem, though. Some features need to be axed in order for this to be possible..!

OPT b and g, and relevant CLI flags. Those relied on modifying the lexer's algorithm at run-time, which I don't know how to do with flex, if at all possible.
Naked symbol interpolation. This is significantly complex to handle, and cannot be handled in the lexer proper due to spanning tokens. I would much rather prefer introducing an alternative, such as insta-expanded EQUS (EVAL?).
The EQUS stack (and recursion limit). It's hard to know at which position we're exiting a symbol now. Might be doable, I need to check YY_USER_ACTION.

Side note: if the above two OPTs are removed, this only leaves OPT p, which I believe is made redundant by ds cnt, val. So then OPT could probably be removed entirely.

JL2210 commented 4 years ago

How will this affect existing code? Can you give examples of what won't work afterwards?

ISSOtm commented 4 years ago

How this affects existing code is the question I always ask myself about each feature change. We don't have any usage statistics beyond pret's disassemblies, and this doesn't even include any hacks based on them.

I do believe OPT and the flags see very little use. According to @rangi42, they use OPT b in a single place to "draw" some graphic using bits. It has been suggested to shim around this by supporting two binary / "gfx" digit sets, the current default and the one presented in the documentation.
Symbol interpolation outside of strings is actually pretty useless. People expect it to work like macro args (#269) but in reality the docs do state it: {sym} is strictly equivalent to "{sym}", except in macro args (because). I discuss my stance on implementing this below.
The EQUS stack missing comes from the fact that while it's pretty clear when an expansion begins, when it ends is much less so, since it requires char counting. This means that EQUS expansion status can no longer be given in error reports, and infinite recursion can no longer be trapped. That said, there's a facility to execute code when any rule is matched, so that might be usable for implementing EQUS tracking again.

The reason why I do not want to make "naked" interpolations work like macro args is that parsing such interpolations is very complex, especially as they can be nested. Thus, it's OK to handle inside a string, since we're already working with a blob of data, but outside of it, it's expected to span multiple tokens (like macro args).

Macro args are not handled in the lexer rules proper, since they happen beyond the concept of tokens; instead, the code responsible for filling the lexer's text buffer (YY_INPUT) is overridden to look for macro args, and "paste" their contents in. The complication is that flex puts a cap on the amount of characters that it expects from the function, so the entire contents can't be plainly dumped. To avoid unnecessary copies/allocations, and thus keep the lexer—one of the two most performance-critical pieces of RGBASM—efficient, additional logic is present to "spread" the pasting across buffer refills.

The problem that "naked" symbol interpolations introduce is that they are really complicated to process. The hardest part is that they can be nested, which opens the can of worms of where to write the expanded results to, and so on. (It worked in the old lexer because the target buffer was large)

JL2210 commented 4 years ago

The only issue I see is that some auto-generation tools (particularly sdcc) generate an .optsdcc -mgbz80.

ISSOtm commented 4 years ago

This is irrelevant, because SDCC uses a custom weird assembler as its back-end instead of RGBDS.

JL2210 commented 4 years ago

@ISSOtm And it also has an option to generate RGBDS assembly, which is where I found that example.

ISSOtm commented 4 years ago

But that wouldn't be valid RGBDS syntax; where is the code generating that?

JL2210 commented 4 years ago

My best guess would be in sdcc-4.0.0/sdas/asgb/gbpst.c, and then maybe sdcc-4.0.0/src/z80/mappings.i.

gbdev / rgbds

[Help wanted] Turn lexer back into flex definition #485