SamCoVT / TaliForth2

A Subroutine Threaded Code (STC) ANSI-like Forth for the 65c02
Other
29 stars 5 forks source link

branch optimizations #54

Closed patricksurry closed 5 months ago

patricksurry commented 5 months ago

I noticed 0branch did some extra gymnastics to rts instead of jmp () like branch. also looking at using native branch for short ones

patricksurry commented 5 months ago

This has a couple of small changes (faster 0branch and native jmp for unconditional branch, like loop already does) and a bigger one that does native branch optimization for short forward branches. You can see both things in this after/before example.

The jsr BRANCH 81f becomes jmp xxx saving a couple of bytes and lots of cycles, and the jsr 0BRANCH 81a becomes jsr Z / beq +8 where Z is a tiny runtime that sets the zero flag from TOS saving a bunch of cycles on the conditional branch.

I routed other users of 0BRANCH like else and while through xt_then to benefit from same optimization. This makes conditionals faster, especially compact ones.

The other candidate for branch optimization is UNTIL which needs a backward branch. If you like this idea I'm happy to add that.

0 nc-limit ! : foo if 3 else 4 then ; see foo 
nt: 81E  xt: 829 
flags (CO AN IM NN UF HC): 0 0 0 1 0 1 
size (decimal): 18 

0829  20 01 92 F0 08 20 80 93  03 00 4C 3B 08 20 80 93   .... .. ..L;. ..
0839  04 00  ..

829   9201 jsr     Z
82C      8 beq      836 v
82E   9380 jsr     LITERAL 3 
833    83B jmp
836   9380 jsr     LITERAL 4 

which previously looked like:

nt: 800  xt: 80B 
flags (CO AN IM NN UF HC): 0 0 0 1 0 1 
size (decimal): 20 

080B  20 1B 92 1A 08 20 9F 93  03 00 20 E4 8D 1F 08 20   .... .. .. .... 
081B  9F 93 04 00  ....

80B   921B jsr     0BRANCH 81A 
810   939F jsr     LITERAL 3 
815   8DE4 jsr     BRANCH 81F 
81A   939F jsr     LITERAL 4 
 ok
patricksurry commented 5 months ago

Did some cleanup and added xt_until support. This adds +80 bytes to the ROM and compilation will be slightly slower but should have nice runtime benefit in cycles and small size saving for indefinite loops and conditionals

SamCoVT commented 5 months ago

I'm generally fine with ROM being a bit larger and compilation time being longer for a savings in cycles and/or bytes in the generated code.

patricksurry commented 5 months ago

I made that simplification you suggested, and also added a few new cycle tests to measure conditionals which show a nice boost:

Before:

: doifword 0 100 0 do i 2 and if 1 else -1 then + loop drop ;  ok
             ' doifword      cycle_test            CYCLES:  31080 ok
: buword 100 begin 1- ?dup 0= until ;  ok
             ' buword        cycle_test            CYCLES:  19752 ok
: bwword 100 begin 1- ?dup while repeat ;  ok
             ' bwword        cycle_test            CYCLES:  15074 ok

after

: doifword 0 100 0 do i 2 and if 1 else -1 then + loop drop ;  ok
             ' doifword      cycle_test            CYCLES:  26180 ok
: buword 100 begin 1- ?dup 0= until ;  ok
             ' buword        cycle_test            CYCLES:  16463 ok
: bwword 100 begin 1- ?dup while repeat ;  ok
             ' bwword        cycle_test            CYCLES:  11663 ok
patricksurry commented 5 months ago

sorry about the random walk here... I think it's in a good state now but i'll have a look at the cycle test and disasm comments you noted.

SamCoVT commented 5 months ago

Let me know when you want this PR merged.

patricksurry commented 5 months ago

i'll clean up the couple things you mentioned and then it should be good.

i assume there's no easy way to avoid the hard-coded tmp1 address in the ed test? that tripped me up for a while after its location shifted 2 bytes. (Also the trailing spaces are apparently important in ed.fs but vi was eventually my friend :)

patricksurry commented 5 months ago

ok this should be good to go now