SamCoVT / TaliForth2

A Subroutine Threaded Code (STC) ANSI-like Forth for the 65c02
Other
29 stars 7 forks source link

simple peephole optimizer #66

Open patricksurry opened 7 months ago

patricksurry commented 7 months ago

One way we could improve space/cycle performance is to use a very simple peephole optimizer. Common pairs of words of the form ( ... -- n ) ( n -- ... ) can often be replaced by a more efficient compound word which avoid stack operations and replace two jsrs with one. Often the compound word could just be an alternate entry point to an existing word where TOS is already in tmp1 or A/Y registers, and/or a flag is set via CPU flags rather than passed as a stack value.

Examples might be things like ?dup if, 0= if, literal +, literal <, ...

I was thinking about trying a simple implementation where compile tracks the last XP it emitted with its corresponding here. When it compiles the next word, it can check if the new XP2 is part of a compound word (XP, XP2). If not, proceed as before. If so, rewind and process the compound instead. This would be a general way of handling literal folding, and would even support longer compound sequences if that was useful (e.g. the compound XP-XP2 could then get rewritten again by a third compound XP-XP2-XP3)

@SamCoVT is this something you'd be interested in? Have you looked at anything along these lines before?

SamCoVT commented 7 months ago

I have written peephole optimizers for compilers for other languages. Tali tends to end up with inx inx followed immediately by dex dex quite often, and it would be wonderful to optimize even that single combination to save 4 bytes and 8 cycles.

One of the biggest issues is going to be that the programmer has access to here. During compiling, the programmer can switch to interpreted mode, access here (perhaps for custom flow control) and then use that later. If here were to be saved right after a word that lays down something like inx inx, then you can't reclaim those - if you do, then the place that jumps to here will jump to the wrong location.

Writing a post-compiling peephole optimizer would be super difficult because Tali allows mixing of data and opcodes in a word and Tali would not be able to detect words created with CREATE/DOES>. You could imagine a LITERAL created with the value $E8E8 (looks like inx inx) that is followed immediately by a word that starts with dex dex, just as an example.

In the same way that allow-native and always-native require the programmer to verify that there are no control structures that result in a JMP instruction, we could write an optimizer that could be called on-demand to process the most recent word with a set of specific instructions for when it is and is not allowed.

patricksurry commented 7 months ago

excellent point. i was thinking about it at the forth word level (i.e. rather than trying to do something post-compilation which seems much more difficult as you say). It should only maintain state of previous word/here whilst it stayed in compile mode. so if you switched out to immediate you'd lose state and it wouldn't recognize a compound word spanning that "unsafe" transition.

i've analyzed my python-preprocessed forth (long sequence of whitespace separated words) looking for common pairs, and ended up explicitly defining a few compounds to save some bytes, e.g. : +c@ + c@ ; or : >= < invert ;. but having common pairs recognized by the compiler would also make it faster in some cases, as well as removing the mental load on the user having to remember which things exist as compounds. There are already some standard examples like 0= which are specially optimized compounds. I always try to write 1= as well, but as a user it'd be nice if I didn't have to think about it so could just write <literal> = and have the compiler make it faster/smaller. In that specific example you could imagine it optimizing to one of 0=, cliteral=, literal= where the first is the existing hand-crafted word, and the other two are specialized entrypoints to xt_equal.

patricksurry commented 7 months ago

gforth seems do something along these lines called "superinstructions", and keeps TOS in a register (see https://www.complang.tuwien.ac.at/forth/gforth/Docs-html/Engine.html#Engine )

FlashForth mentions that DUP and 0= before IF WHILE UNTIL are optimized away which sounds similar.

counting common pairs in my code base yields things like:

     if drop
     and if
     = if
     0= if
     or if
     = or
     0= and
     @ dup
     1 =
     @ if
     ?dup if

common patterns seem to be ( ... -- f ) ( f -- ... ) which could be replaced by a compound that uses a CPU flag to pair two words rather than a stack flag, and ( ... -- x ) ( x -- ... ) which could use A/Y or zp tmp1 as a register for TOS between words. e.g. dup followed by a word that consumes TOS could avoid push/pop.

For above the flag with conditional examples seem to be the lowest hanging fruit. Much like the zero_test_runtime we introduced for branch optimization.

I'm not sure if a TOS register would be as much benefit, tho words are often immediately copying to tmp1 for example. Perhaps some words could optionally support receiving TOS already in a tos zp location?