Alternate loop optimization

patricksurry commented 5 months ago

This version takes all the loop control off the CPU/R stack and puts it at the bottom of page 1. This gives an upward growing loop control stack where we never need to move anything around, just change an index. This simplifies various things and doesn't cause any more risk to the CPU stack since items that once were there are just at the bottom of the stack page instead.

patricksurry commented 5 months ago

Pros and cons:

- currently the most common i++ inc is slower here.
- i is currently 21 bytes so defaults to slower subroutine version
- perhaps unorthodox to have no loop control on the stack?
? could actually use R across loops but that wouldn't be portable
+ less stack manipulation means we don't need AN for i,j,unloop etc; so smaller code if you don't mind speed tradeoff
+ loop/unloop setup/teardown is a lot faster, so expensive nested loops get most benefit

i have some ideas about making i++ fast again. also thinking about the i math; that's particularly time-consuming (and chunky inline code) in a normal LOOP since you have to do the 16 bit math every iteration, as opposed to 255/266 times just doing an inc.

SamCoVT commented 5 months ago

This is very interesting. Because you're no longer in zero page, you'll be indexing with a 16-bit base address (eg. $0100 and $0102 for loopindex and loopfufa), so you could move this anywhere at the same speed.

It's too bad that zero page wraps - otherwise you could use $FF and an index starting at 1 to reach those locations and save a byte on each instruction that touches the data.

patricksurry commented 5 months ago

Ya I actually had a bug for a while trying to use zero page wrapping intentionally like $fc,y but of course that is actually a 16bit index with y so it doesn’t wrap. Took me a while. But cool that I could easily debug forth with forth!

On Fri, Mar 29, 2024 at 3:50 PM SamCoVT @.***> wrote:

This is very interesting. Because you're no longer in zero page, you'll be indexing with a 16-bit base address (eg. $0100 and $0102 for loopindex and loopfufa), so you could move this anywhere at the same speed.

It's too bad that zero page wraps - otherwise you could use $FF and an index starting at 1 to reach those locations and save a byte on each instruction that touches the data.

— Reply to this email directly, view it on GitHub https://github.com/SamCoVT/TaliForth2/pull/63#issuecomment-2027674342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA5MKUBAE4UE3PDWIW3IFTY2XAYRAVCNFSM6AAAAABFOKIXC2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGY3TIMZUGI . You are receiving this because you authored the thread.Message ID: @.***>

patricksurry commented 5 months ago

Optimizing for single step inner LOOP works, particularly for long loops (below), but the cost (cycles and complexity) of runtime checks for optimization is probably too high for small loops. I'll try one other idea for doing the optimization at compile time.

The goal would be to choose the faster LOOP and I runtimes at compile time so there are no compile time choices. But this means we have to detect inner loops at compile time. The challenge is that they could occur in any word within the loop (including after I appears). It's probably feasible by taking one of the unused word status bits as a flag for "uses LOOP". We'd only flag the LOOP and EXECUTE native words. Then when compiling any new definition we'd OR together the flag values for any words it uses to set its flag value. The I optimization could be a separate choice (or we could forget about it) but it would need some kind of back-patching like LEAVE since we don't know the right runtime until we get to the closing LOOP.

I'll try a mockup adding LOOP' and I' words so I can manually optimize loops and see what the performance looks like.

: dodowordbigi 10 0 do 1024 0 do i drop loop loop ;  ok
# master
             ' dodowordbigi  cycle_test            CYCLES: 822298 ok
# this version (15% faster)
             ' dodowordbigi  cycle_test            CYCLES: 700317 ok

patricksurry commented 4 months ago

see #53

SamCoVT / TaliForth2

Alternate loop optimization #63