loop streamlining - Githubissues

patricksurry commented 5 months ago

Here are a few ideas for do loop optimization while I was figuring out how it worked.

The ROM size is +6 bytes where some space optimizations were compensated by adding a specialized loop_runtime (+1 instead of +step).

The runtime is about 40 bytes less for each instance by making do_runtime a subroutine, so slightly more expensive on loop startup, but the loop step (for step=1) is shorter and quicker (avoiding the jsr xt_one stack juggling stuff).

Saved a couple of bytes in xt_i but wondering if there's something better there. If you assume i is accessed in most loops it might be better to speculatively increment the index at the same time as incremental the fudged limit in the loop footer?

SamCoVT commented 5 months ago

Having a special runtime for +1 looping seems reasonable. In my own Forth code, I'm much more likely to be either counting or doing sequential things, so I use LOOP way more often than +LOOP.

Sometimes I use I in loops and sometimes I don't, so I'm not sure if it's worth having the fudge factor or not. It's designed to make it so the sign bit in the MSB can be used to know when to stop (note that with +LOOP, it's possible that you don't stop exactly on the ending value, so you have to handle passing the ending value). It's a hassle to "un-fudge" the index for I, though.

The testing for loops is reasonably thorough, so you'll likely know quickly if you break anything. There currently isn't any cycle counting testing for do loops. Perhaps it makes sense to define some words with various kinds of do loops and cycle count them? It would be good to have a mix of w/ and w/out I, nested loops, and using +LOOP with both positive and negative values.

Some other Forths keep the top-level loop index/limit in zero page and only push them onto the return stack when starting another loop inside a loop. This makes for much faster access to the index. I is more likely to be used than J, and J wouldn't really be any slower than it is now. Tali doesn't have much zero page free (due to a self-imposed limit of only using half of zero page), but it might be possible to make some room.

Some other Forths have a control stack, separate from the data stack and return stack. I'm not sure that helps us here.

patricksurry commented 5 months ago

I think I addressed all of your feedback.

Here's what 0 nc-limit ! : foo 0 10 1 do i + loop ; see foo looks like now.

The i word is still pretty heavy (and inlined) esp when increment is 1. I'm still tempted to also increment i in loop / +loop, and keep it on the stack instead of the base FUFA. Then i could be a lightweight subroutine (or inline) that just makes a copy.

I guess also arguable whether the end loop / +loop should always be inlined - a subroutine is a little more expensive but saves a fair bit of space. Presumably tricky to make it work both ways since the stack arithmetic would have to shift by two. Or is there already machinery for that?

size (decimal): 68 

080B  20 83 A7 20 9F 93 0A 00  20 12 98 A9 08 48 A9 4E   .. ....  ....H.N
081B  48 20 B9 8B CA CA DA BA  38 BD 02 01 FD 04 01 A8  H ...... 8.......
082B  BD 03 01 FD 05 01 FA 95  01 94 00 20 ED 99 B8 7A  ........ ... ...z
083B  C8 D0 05 68 18 69 01 48  5A 70 03 4C 1F 08 68 68  ...h.i.H Zp.L..hh
084B  68 68 68 68  hhhh

80B   A783 jsr     0
80E   939F jsr     LITERAL A 
813   9812 jsr     1
816      8 lda.#
818        pha
819     4E lda.#
81B        pha
81C   8BB9 jsr     DO
81F        dex
820        dex
821        phx
822        tsx
823        sec
824    102 lda.x
827    104 sbc.x
82A        tay
82B    103 lda.x
82E    105 sbc.x
831        plx
832      1 sta.zx
834      0 sty.zx
836   99ED jsr     +
839        clv
83A        ply
83B        iny
83C      5 bne      843 v
83E        pla
83F        clc
840      1 adc.#
842        pha
843        phy
844      3 bvs      849 v
846    81F jmp
849        pla
84A        pla
84B        pla
84C        pla
84D        pla
84E        pla

SamCoVT commented 5 months ago

The i word is still pretty heavy (and inlined) esp when increment is 1. I'm still tempted to also increment i in loop / +loop, and keep it on the stack instead of the base FUFA. Then i could be a lightweight subroutine (or inline) that just makes a copy.

I guess also arguable whether the end loop / +loop should always be inlined - a subroutine is a little more expensive but saves a fair bit of space. Presumably tricky to make it work both ways since the stack arithmetic would have to shift by two. Or is there already machinery for that?

I believe the loop words are flagged AlwaysNative, and it's precisely because if they were a JSR the values would be in a different place on the return stack. It's also to save 12 cycles of a JSR/RTS. There is currently no easy way to handle both a JSR and natively compiled version.

Moving the current value of I into zero page would give a significant speedup and size reduction, as accessing it would be just zero page accesses, but we'd need to find a place to put it. There are other zero-page system variables that could be moved into the RAM system variables to make room. I think, ideally, we should have the fudged value and the fudge factor in zero page to speed up I as much as possible. When starting a loop, the values in those would be pushed onto the stack (so J would still be slower) to make room for the current values.

Does it make sense to pull what you have so far? It is already better/faster. If you are interested in trying out a zero page version, you should let me know and I can shuffle some things around first to make room for the variables you'll need.

patricksurry commented 5 months ago

yup, i think this current state is a decent intermediate point to pull.

if you give me some help on organizing the storage (and maybe a bit more explanation of how nested loops should behave in terms of pushing things to stack) i'd be game to try out a zp version next week

patricksurry commented 4 months ago

see #53

SamCoVT / TaliForth2

loop streamlining #52