loop speedups - Githubissues

SamCoVT commented 5 months ago

@patricksurry your loop speedup work so far has made Tali noticeably faster. I've added cycle counting with a few simple looping scenarios, and here are the before/after results:

Before merging loop speedup:
: do?word1 5 5 ?do loop ;  ok
             ' do?word1      cycle_test            CYCLES:    322 ok
: do?word2 100 0 ?do i drop loop ;  ok
             ' do?word2      cycle_test            CYCLES:  12836 ok
: doword 100 0 do loop ;  ok
             ' doword        cycle_test            CYCLES:   6700 ok
: dowordi 100 0 do i drop loop ;  ok
             ' dowordi       cycle_test            CYCLES:  12700 ok
: dodoword 100 0 do 10 0 do loop loop ;  ok
             ' dodoword      cycle_test            CYCLES:  90500 ok
: dodowordij 100 0 do 10 0 do i drop j drop loop loop ;  ok
             ' dodowordij    cycle_test            CYCLES: 210500 ok
: doword+loop 100 0 do 5 +loop ;  ok
             ' doword+loop   cycle_test            CYCLES:   2420 ok

After merging loop speedup:
: do?word1 5 5 ?do loop ;  ok
             ' do?word1      cycle_test            CYCLES:    322 ok
: do?word2 100 0 ?do i drop loop ;  ok
             ' do?word2      cycle_test            CYCLES:   8384 ok
: doword 100 0 do loop ;  ok
             ' doword        cycle_test            CYCLES:   2148 ok
: dowordi 100 0 do i drop loop ;  ok
             ' dowordi       cycle_test            CYCLES:   8248 ok
: dodoword 100 0 do 10 0 do loop loop ;  ok
             ' dodoword      cycle_test            CYCLES:  44748 ok
: dodowordij 100 0 do 10 0 do i drop j drop loop loop ;  ok
             ' dodowordij    cycle_test            CYCLES: 165748 ok
: doword+loop 100 0 do 5 +loop ;  ok
             ' doword+loop   cycle_test            CYCLES:   2418 ok

patricksurry commented 5 months ago

nice!

SamCoVT commented 5 months ago

Moving the remaining zero page system variables is going to be more work than I was expecting (probably because I already did all of the easy ones). To get started, I'm going to recommend you simply steal two cells off of the data stack. The zero page system variables start at label user0 (from platform file) and grow higher in addresses, while the data stack starts at dsp0 (set a little below zpage_end (from platform file) to allow non-catastrophic underflow) and grows down towards the zero page system variables. The "expendable" zero page system variables are placed at higher addresses so they can be overwritten by the data stack if needed. The loop variables are not "expendable", so they shouldn't be placed right at the end.

In definitions.asm, I'll recommend you place your variables (I'm thinking you'll want the fudged value of I and the fudge factor) starting at user0+34 (tmpbranch is currently there) and shuffle all the others up. There is a comment that calculates how much space is left for the Data Stack - you should fix up that comment as well.

When starting a loop, the contents of these variables will be pushed onto the stack. When ending a loop, those contents will be pulled back into these variables (just in case we are exiting an inner loop). The words I, LOOP, and +LOOP can use these variables directly. The word J will need to be adjusted (maybe?) if the outer loop's index is in a different place on the return stack than it used to be.

I don't know how much speedup there will be here, so this is also a bit of an experiment. The push/pull only happens once per loop and then words like I won't need to play with X to access things on the return stack (but will still need to subtract the fudge factor)

patricksurry commented 5 months ago

oh, i think I see what you're saying, so we can handle situations like below by pushing/pulling the tmp vars from the stack on entry exit?

: tri 0 swap 1 do i + loop ;  ok
: trisum 0 10 2 do i tri + loop ;  ok
3 tri . 3  ok
trisum . 120  ok

patricksurry commented 5 months ago

another thought: would it be feasible for the compiler to track an is-i-used? status flag after it sees do, so that we could choose between two entry points to loop_runtime? e.g one would flag that it should update i along with the fudged limit, and the other would skip the update (tho i would still have a slot in the stack). this way i could replace fufa on the stack and xt_i wouldn't require any math

SamCoVT commented 5 months ago

Yes to the pushing/pulling the zero page variables - setting up and tearing down the loop will probably be about the same (perhaps even a bit longer), but the speedup should appear in both incrementing the loop (either with LOOP or with +LOOP, and those can still have separate runtimes) and with accessing I. For nested loops, J will likely take the exact same amount of time to run (only the offsets into the return stack might be different, but the general code can stay the same). I think I is used way more than J, so I don't think we need to try to speedup J accesses.

I don't think it's going to be worth the complexity to try to keep track of if I is accessed or not. One of the goals is for Tali to stay understandable for those that want to look under the hood. It's not that many cycles to "unfudge" I.

patricksurry commented 5 months ago

btw are those new looping cycle tests in master or coming soon? they'd be useful to watch as benchmarks for this

SamCoVT commented 5 months ago

The tests for do loops are already in master. Search results.txt for do?word1.

patricksurry commented 5 months ago

I can probably get rid of tmpbranch completely after my branch opts. would it also make sense to move nc_limit out of zp? it isn't immediately obvious that it needs to be there.

same with state I guess - seems a shame to use 2 bytes of zp for effectively one bit. does it need zp adr modes or am I missing something else?

SamCoVT commented 5 months ago

Yes, nc_limit absolutely can move - it only gets checked once per word and only when compiling. It should be one of the easier items to move, as it's not referenced in a bunch of places. You can look at the beginning of xt_get_order: to see how data in the RAM system variables are accessed using Y indexing against UP (although #order is a single byte, in that case). I think 50 is the next available "offset" after the block I/O vectors in definitions.asm. Up to 255 is available here (256 bytes are reserved on Tali startup).

That should free up locations for putting loop variables in zero page without having to steal any locations from the data stack.

state can also go, but it is likely used in more places so it will take a bit (pun intended?) to wrangle into RAM system variables. It's only used once per word and only when interpreting (unless the user invokes it directly).

Unless they need to be fast, the only item that really needs to be in zero page is UP to point to where the RAM variables are. Scot (the original author) had a vision of multitasking, where UP could be swapped out for a different task and that task would have all of its own state info there, potentially even supporting a multi-user system (thus the name UserPointer for the RAM system variables).

SamCoVT commented 5 months ago

Also, there is no need to use two bytes for items that are not directly accessible from Forth, but you will want to use two bytes for anything that you might hand out the address for it that would be used with @ and ! from within Forth. The forth word STATE gives the address of the state variable, so it will have to be two bytes - even though forth programs are not allowed to modify it.

SamCoVT commented 5 months ago

Just some additional notes when editing the zero page variables: The initial values are given in cold_zp_table below xt_quit:. interpret: in taliforth.asm uses state.

The RAM system variables are in cold_user_table: below xt_quit:

patricksurry commented 5 months ago

cool, i'll play around with this at some point in the near future. I remember having some initial confusion about how the offsets in definitions were separated from the actual cold_xx_tables even tho they must be kept synced. maybe there is some clearer way to organize that. hmm.

SamCoVT commented 5 months ago

Tali generally is not fussy about where stuff gets put, so I'd certainly entertain moving the starting-values tables into definitions.asm to keep them together with the offsets. Currently there is no code in that file and it's just... well... definitions.

Putting code in that file will have a nasty side effect that COLD will no longer be the first thing at the beginning of "tali space". The definitions.asm file is .include-ed in taliforth.asm before any of the others. tass64 can handle forward references, so moving it later in the list works fine and the only penalty is an extra pass by tass64 (4 passes instead of 3 and I can't tell any speed difference on my PC).

SamCoVT commented 5 months ago

So now there are three loop contenders. Here are my thoughts based on your comments in all three PRs:

I don't think I want to use a status bit to try to keep track of which words contain loops and which do not. That feels too much like a house of cards and it makes it more difficult for others to understand what is happening under the hood if they want to take a peek.

I don't mind some complexity during compiling to make the compiled code cleaner, smaller, or faster as long as it is easy to follow.

I don't mind adjusting the default setting for nc-limit - 20 was just a guesstimate of where it should be.

You've already sped up the loops by a significant margin, so the final solution doesn't have to be the fastest and it may make sense to pick one because it's smaller or clearer instead.

So with all that said, which do you think is the best and what advantages/disadvantages does it have over the other two (or even compared with what we have now).

patricksurry commented 5 months ago

Here's the cycles summary for the three versions (plus master). push/pull #60 is the standard R stack pulling loop index/fufa to zp, lcb cmplx #63 is complex version of LCB stack, and lcb simple #64 is streamlined version.

BTW I was thrown off for a while because the tali.fs test changes the default nc-limit so I was getting different answers running talitest cycles.fs directly. I modified cycles.fs to reset to that same default so all these comparisons use that.

Obviously these tests aren't real benchmarks so I wouldn't necessarily read a lot into the comparison, but the two clear contenders are push/pull and lcb simple. The TL;DR of the performance difference is that push/pull is 6 cycles faster for xt_i because it's using ZP directly, but lcb simple is signifcantly better with loop nesting since it only needs to shadow one byte from zp instead of four, and there's no data movement for adding/removing a loop level (it just increments the lcb sp rather than any push/pull).

LCB is also nice because the code is simpler to understand (no stack shenanigans), uses less zp, and means various loop words like I and J can be compiled native or as subroutines to reduce compiled code size if you want. (That is a big one for me.) As you said the LCB stack could move anywhere so you could also have more CPU stack space back if you need it. I guess it's "non-standard" if anyone is expecting loop control on the R stack but afaik that's meant to be opaque so shouldn't be depended on. (And in fact with LCB approach tali users could use R during loops if they wanted non-portable code :)

So overall I'd vote for #64 If you agree I can do a cleanup of the comments and doc.

: do?word1 5 5 ?do loop ;
    master      322     ; all about the same
    push/pull   325
    lcb cmplx   325
    lcb simple  325

: do?word2 100 0 ?do i drop loop ;  
    master      8384
*   push/pull   6052    ; -6 cycles for i 
    lcb cmplx   7218    ; speculatively incrementing i
    lcb simple  6658    

: doword 100 0 do loop ;
    master      2148
*   push/pull   1304    ; simple one-level loop
    lcb cmplx   2651
*   lcb simple  1310    ; about the same

: dowordi 100 0 do i drop loop ;
    master      8248
    push/pull   5904    ; -6 cycles for i
    lcb cmplx   7169
    lcb simple  6511

: dodoword 100 0 do 10 0 do loop loop ;
    master      44748
    push/pull   42704
    lcb cmplx   51318
*   lcb simple  33410   ; 9294/100 => 90 cycles better for nested loops

: dodowordij 100 0 do 10 0 do i drop j drop loop loop ;
    master      165748
    push/pull   149704  ; not a huge difference when all native comple
    lcb cmplx   158217
    lcb simple  156410  ; default J is a JSR; +6 cycles for i
*   lcb simple' 144410  ; forcing J native compile

: dodowordbigi 10 0 do 1024 0 do i drop loop loop ; 
    master      822298
*   push/pull   587424  ; -6 cycles for xt_i
    lcb cmplx   700317  
    lcb simple  648060  ; diff 60636 = +61440 for `i`, 804 better otherwise

: doword+loop 100 0 do 5 +loop ;
    master      2418
*   push/pull   2291
    lcb cmplx   2601
*   lcb simple  2213    ; faster when step <256

SamCoVT commented 5 months ago

That sounds good. I'll take a look at the lcb simple version. Thanks for putting in all the work. It's neat to see the multiple possible implementations and their strengths/weaknesses. Once you are done (or before, even), the folks in the Forth forum on 6502.org might be interested in a description of your implementation and resulting speedup.

SamCoVT commented 5 months ago

I noticed you were using the current master branch when you really should be using the original version before your first loops speedups were merged. I've added those below as original and I think that gives a much better sense of scale of your speedups as well as showing that the differences between the various options are actually quite small. I agree that paying a few cycles for simplicity is worth it here.

: do?word1 5 5 ?do loop ;
    original    322
    master      322     ; all about the same
    push/pull   325
    lcb cmplx   325
    lcb simple  325

: do?word2 100 0 ?do i drop loop ;  
    original   12836
    master      8384
*   push/pull   6052    ; -6 cycles for i 
    lcb cmplx   7218    ; speculatively incrementing i
    lcb simple  6658    

: doword 100 0 do loop ;
    original    6700
    master      2148
*   push/pull   1304    ; simple one-level loop
    lcb cmplx   2651
*   lcb simple  1310    ; about the same

: dowordi 100 0 do i drop loop ;
    original   12700
    master      8248
    push/pull   5904    ; -6 cycles for i
    lcb cmplx   7169
    lcb simple  6511

: dodoword 100 0 do 10 0 do loop loop ;
    original    90500
    master      44748
    push/pull   42704
    lcb cmplx   51318
*   lcb simple  33410   ; 9294/100 => 90 cycles better for nested loops

: dodowordij 100 0 do 10 0 do i drop j drop loop loop ;
    original    210500
    master      165748
    push/pull   149704  ; not a huge difference when all native comple
    lcb cmplx   158217
    lcb simple  156410  ; default J is a JSR; +6 cycles for i
*   lcb simple' 144410  ; forcing J native compile

: dodowordbigi 10 0 do 1024 0 do i drop loop loop ; 
    original    really long
    master      822298
*   push/pull   587424  ; -6 cycles for xt_i
    lcb cmplx   700317  
    lcb simple  648060  ; diff 60636 = +61440 for `i`, 804 better otherwise

: doword+loop 100 0 do 5 +loop ;
    original    2420
    master      2418
*   push/pull   2291
    lcb cmplx   2601
*   lcb simple  2213    ; faster when step `<256`

SamCoVT commented 5 months ago

I checked out the "original" codebase to run your dodowordbigi test for cycle count:

: dodowordbigi 10 0 do 1024 0 do i drop loop loop ; 
    original   1282730
    master      822298
*   push/pull   587424  ; -6 cycles for xt_i
    lcb cmplx   700317  
    lcb simple  648060  ; diff 60636 = +61440 for `i`, 804 better otherwise

SamCoVT commented 5 months ago

OK to close the other two loop PRs?

patricksurry commented 5 months ago

Yup, I just did. Want to close this also?

SamCoVT commented 5 months ago

If you want to edit the documentation, let's leave this open until that's done. If you're not interested in documentation, let me know and I can make the changes and just have you proofread it to make sure I got the technical details right.

SamCoVT / TaliForth2

loop speedups #53