Closed patricksurry closed 8 months ago
Having a special runtime for +1 looping seems reasonable. In my own Forth code, I'm much more likely to be either counting or doing sequential things, so I use LOOP
way more often than +LOOP
.
Sometimes I use I
in loops and sometimes I don't, so I'm not sure if it's worth having the fudge factor or not. It's designed to make it so the sign bit in the MSB can be used to know when to stop (note that with +LOOP
, it's possible that you don't stop exactly on the ending value, so you have to handle passing the ending value). It's a hassle to "un-fudge" the index for I
, though.
The testing for loops is reasonably thorough, so you'll likely know quickly if you break anything. There currently isn't any cycle counting testing for do loops. Perhaps it makes sense to define some words with various kinds of do loops and cycle count them? It would be good to have a mix of w/ and w/out I
, nested loops, and using +LOOP
with both positive and negative values.
Some other Forths keep the top-level loop index/limit in zero page and only push them onto the return stack when starting another loop inside a loop. This makes for much faster access to the index. I
is more likely to be used than J
, and J
wouldn't really be any slower than it is now. Tali doesn't have much zero page free (due to a self-imposed limit of only using half of zero page), but it might be possible to make some room.
Some other Forths have a control stack, separate from the data stack and return stack. I'm not sure that helps us here.
I think I addressed all of your feedback.
Here's what 0 nc-limit ! : foo 0 10 1 do i + loop ; see foo
looks like now.
The i
word is still pretty heavy (and inlined) esp when increment is 1. I'm still tempted to also increment i
in loop
/ +loop
, and keep it on the stack instead of the base FUFA. Then i
could be a lightweight subroutine (or inline) that just makes a copy.
I guess also arguable whether the end loop
/ +loop
should always be inlined - a subroutine is a little more expensive but saves a fair bit of space. Presumably tricky to make it work both ways since the stack arithmetic would have to shift by two. Or is there already machinery for that?
size (decimal): 68
080B 20 83 A7 20 9F 93 0A 00 20 12 98 A9 08 48 A9 4E .. .... ....H.N
081B 48 20 B9 8B CA CA DA BA 38 BD 02 01 FD 04 01 A8 H ...... 8.......
082B BD 03 01 FD 05 01 FA 95 01 94 00 20 ED 99 B8 7A ........ ... ...z
083B C8 D0 05 68 18 69 01 48 5A 70 03 4C 1F 08 68 68 ...h.i.H Zp.L..hh
084B 68 68 68 68 hhhh
80B A783 jsr 0
80E 939F jsr LITERAL A
813 9812 jsr 1
816 8 lda.#
818 pha
819 4E lda.#
81B pha
81C 8BB9 jsr DO
81F dex
820 dex
821 phx
822 tsx
823 sec
824 102 lda.x
827 104 sbc.x
82A tay
82B 103 lda.x
82E 105 sbc.x
831 plx
832 1 sta.zx
834 0 sty.zx
836 99ED jsr +
839 clv
83A ply
83B iny
83C 5 bne 843 v
83E pla
83F clc
840 1 adc.#
842 pha
843 phy
844 3 bvs 849 v
846 81F jmp
849 pla
84A pla
84B pla
84C pla
84D pla
84E pla
The
i
word is still pretty heavy (and inlined) esp when increment is 1. I'm still tempted to also incrementi
inloop
/+loop
, and keep it on the stack instead of the base FUFA. Theni
could be a lightweight subroutine (or inline) that just makes a copy.I guess also arguable whether the end
loop
/+loop
should always be inlined - a subroutine is a little more expensive but saves a fair bit of space. Presumably tricky to make it work both ways since the stack arithmetic would have to shift by two. Or is there already machinery for that?
I believe the loop words are flagged AlwaysNative, and it's precisely because if they were a JSR the values would be in a different place on the return stack. It's also to save 12 cycles of a JSR/RTS. There is currently no easy way to handle both a JSR and natively compiled version.
Moving the current value of I
into zero page would give a significant speedup and size reduction, as accessing it would be just zero page accesses, but we'd need to find a place to put it. There are other zero-page system variables that could be moved into the RAM system variables to make room. I think, ideally, we should have the fudged value and the fudge factor in zero page to speed up I as much as possible. When starting a loop, the values in those would be pushed onto the stack (so J
would still be slower) to make room for the current values.
Does it make sense to pull what you have so far? It is already better/faster. If you are interested in trying out a zero page version, you should let me know and I can shuffle some things around first to make room for the variables you'll need.
yup, i think this current state is a decent intermediate point to pull.
if you give me some help on organizing the storage (and maybe a bit more explanation of how nested loops should behave in terms of pushing things to stack) i'd be game to try out a zp version next week
see #53
Here are a few ideas for do loop optimization while I was figuring out how it worked.
The ROM size is +6 bytes where some space optimizations were compensated by adding a specialized loop_runtime (+1 instead of +step).
The runtime is about 40 bytes less for each instance by making do_runtime a subroutine, so slightly more expensive on loop startup, but the loop step (for step=1) is shorter and quicker (avoiding the jsr xt_one stack juggling stuff).
Saved a couple of bytes in xt_i but wondering if there's something better there. If you assume i is accessed in most loops it might be better to speculatively increment the index at the same time as incremental the fudged limit in the loop footer?