SamCoVT / TaliForth2

A Subroutine Threaded Code (STC) ANSI-like Forth for the 65c02
Other
29 stars 7 forks source link

Some simple benchmarks #102

Closed patricksurry closed 4 months ago

patricksurry commented 5 months ago

I didn't add this to the standard test suite since it takes a few seconds to run, but you can run manually like

cat tests/bench.fs | c65/c65 -r taliforth-c65.bin

Here's a quick comparison to a Nov'23 binary with and without strip-underflow set to true. (nc-limit is the default 20)

tl;dr the loop speedups and other optimizations gained about 25%, the underflow improvements add another 10%.

cat tests/bench.fs | c65/c65 -r taliforth-py65mon-20231101.bin  # false strip-underflow
ddbench: 3965206 cycles.
intcalcs: 25882848 cycles.
fib2-bench: 28490885 cycles.
nesting: 25165977 cycles.
sieve-bench: 11737688 cycles.
gcd1-bench: 37438386 cycles.
pal-bench: 55207542 cycles.
coll-bench: 35334127 cycles.
Complete: 223222659 cycles  ok

cat tests/bench.fs | c65/c65 -r taliforth-py65mon-20231101.bin  # true strip-underflow
ddbench: 2916598 cycles.
intcalcs: 24090688 cycles.
fib2-bench: 21501909 cycles.
nesting: 25165929 cycles.
sieve-bench: 9499012 cycles.
gcd1-bench: 32771202 cycles.
pal-bench: 54089462 cycles.
coll-bench: 33462223 cycles.
Complete: 203497023 cycles  ok    \ 9% faster than 20231101 with underflow

cat tests/bench.fs | c65/c65 -r taliforth-c65.bin # false strip-underflow
ddbench: 2197901 cycles.
intcalcs: 24298747 cycles.
fib2-bench: 20633878 cycles.
nesting: 25165977 cycles.
sieve-bench: 9528394 cycles.
gcd1-bench: 25723171 cycles.
pal-bench: 27761597 cycles.
coll-bench: 29928374 cycles.
Complete: 165238039 cycles  ok  \ 26% faster than 20231101 with underflow

cat tests/bench.fs | c65/c65 -r taliforth-c65.bin # true strip-underflow
ddbench: 1149293 cycles.
intcalcs: 22050559 cycles.
fib2-bench: 13644902 cycles.
nesting: 25165929 cycles.
sieve-bench: 7280223 cycles.
gcd1-bench: 16415883 cycles.
pal-bench: 26406021 cycles.
coll-bench: 28056470 cycles.
Complete: 140169280 cycles  ok \ 31% faster than 20231101 with no underflow
patricksurry commented 5 months ago

One nice twist: if you tag allow-native after each word in the nesting benchmark, it drops from 25165929 cycles to 0 since it happily inlines away all the empty words 🎉

Is the current never-native default for user words just for safety? Perhaps it should switch to allow-native ?

SamCoVT commented 5 months ago

The current never-native status on new words is indeed for safety because we currently don't have a way to tell if a word contains a JMP or not. If there it a JMP, then native compiling is dangerous because it will jump back into the original word (and return at the end of that word, completely skipping the stuff that was compiled AFTER that word in the newly compiled word). If the user did not use any flow control or JMP instructions, then they can apply allow-native or always-native as they see fit. This is described in the "Native Compiling" section of the manual (search for allow-native and you'll find it pretty quickly).

If you can devise a way where we can absolutely determine that a word doesn't contain JMP (or mess with the return address, which is another case where it has to be JSRd to), then it would theoretically be possible to flag a new word allow-native (which is what you get if neither AN (always native) or NN (never native) flags are set) automatically.

patricksurry commented 5 months ago

oh, so this would cover cases where words contained literal assembly code, like cycles in the tests?

i was thinking of typical words written purely in forth (without knowledge that it's a 6502 underneath) - presumably those would all be safe? i was assuming this would cover most normal use: the former seems like an advanced use case where the user might be expected to know they should tag as never-native ?

SamCoVT commented 5 months ago

Words written in Forth are not safe either, which is why Never-Native is the default. If they have flow control that uses JMP (such as IF/ELSE/THEN or a loop), then they can't be natively compiled. Here is an example:

: aword dup if . else drop then ;
allow-native
100 nc-limit !
: bword 5 aword ." Never makes it here" ;
bword ( should print "5 Never makes it here" but it does not )

Try see aword and see bword and see if you can spot the problem.

patricksurry commented 5 months ago

💡

SamCoVT commented 5 months ago

I do like having this benchmark option, especially with the ability on GitHub to go back and fetch an earlier binary. You've made a ton of great progress on speeding up Tali during general use.

If it's not going to be run with the regular test suite, I'd recommend putting a comment at the very top showing how to run it. You could also add the above benchmarks you've done (with dates from the binaries used) so there is something to reference against if someone runs it in the future. Once I merge this, the documentation you have here will be hidden amongst the many merged pull requests.

patricksurry commented 5 months ago

Good call. I added some instructions and started a results log. indicates master along the bugfixes here which were breaking a couple of the tests. Both bugs (mismatching UF between header and body, and skipping the underflow optimization if the word also had stack juggling) also have fixes in other branches but I'll just update as you merge things.

SamCoVT commented 5 months ago

I just merged a bunch of PRs - I think you merged master into this PR somewhere in the middle of my merge-o-thon. Let me know once you like this one for pulling.

A key for the table might be nice - it took me a minute or two of looking at it to realize SUF was Strip UnderFlow (my mind was stuck on "suffix" for some reason). I was able to sus it out after looking at the commands at the top.

patricksurry commented 5 months ago

Yup, this looks good to go now - updated the comments plus the latest results since all the fixes here were picked up in other branches.