Closed patricksurry closed 7 months ago
This sounds very cool, and I'd be very interested to see if you could make it work, but it's probably not something I'd add to Tali because of the added complexity and my own limited time to maintain Tali. I feel it also might make the disassembly even less clear than it currently is for new users.
Tali is public domain, so you are welcome to take as much or little as you like and create your own Forth that you maintain separately. Because Forth is a little like a scaffold for you to build the language you actually want, it's generally expected that folks will fork Tali and customize it heavily for their specific use case.
I am generally interested in things that make Tali (in descending order of interest): Easier to understand or use Simpler Faster Smaller
One issue that I would very much appreciate help in is speeding up DO
loops. This is one place that Tali is not particularly fast, but it's a pretty tricky topic and I don't know how much speedup is even possible.
wow do
does look pretty heavy in both space and cycles
played with this idea a little. a simplified version is just to add a static six byte prologue in front of any word you want to opt in for constant folding. this basically just adds a jsr literal_runtime
entrypoint in front of such words. the compiler works as described above: it remembers if the last thing it compiled was a literal. if the current word supports constant folding and isn't inlined, it can rewind and replace jsr literal_runtime
with jsr word-6
or jsr word-4
to indicate that the subsequent 2 or 1 bytes should be pushed to TOS before falling thru normally to the word. (if the literal msb is 0 it rewinds the cp one byte and uses the one-byte entrypoint) essentially just using the word's predefined entrypoint for literal_runtime instead of adding one in the code stream.
This trades 6 bytes for each optin word definition for saving 3-4 bytes each time the word is called immediately after a literal. In my code base which compiles to about 20K (with nc-limit 0), there are about 500 literal references (mostly single bytes). Opting in ten common native words would save about 1.2K of generated code. Exposing a flag for user words to also opt-in would increase potential saving to 10%.
; prologue to handle inline literal
; word-6 entry for 2 byte trailing literal
sec ; 2 byte word is stored after RTS
.byte $24 ; mask clc using BIT zp so C unchanged (bcs is +1 byte, -0 cycle)
; word-4 entry for 1 byte trailing literal
clc ; 1 byte is stored after RTS; MSB is 0
jsr fetch_literal ; pushes inline literal to TOS
xt_word: ; normal entry point
... code unchanged ...
the fetch_literal
code looks like this:
fetch_literal:
; We want to fetch the one (C=0) or two (C=1) byte constant folded into the code stream
; to TOS The calling seuqence was like this:
;
; ; compiled code stream
; jsr xt_word-6 ; enter prologue
; data: lsb msb ; folded data
; jsr xt_next ; next word
; ...
;
; ; the xt_word prologue where we've been called
; xt_word-6:
; ...
; jsr fetch_literal
; xt_word:
; ...
;
; all of which to say the stack currently looks like this:
;
; $100+SP:
; .byte ? ; next free space on stack
; .word xt_word-1
; .word data-1 ; fetch and increment
;
; we need to fetch the byte(s) at data, and increment the return address
dex ; make space on forth stack
dex
stz 1,x ; zero MSB in case C=0
txa
tay ; move forth stack ptr to Y ...
tsx ; ... so we can move cpu SP to X
phy ; also stash forth stack so we can modify Y
lda $103,x ; #<(data-1)
sta tmp1
lda $104,x ; #>(data-1)
sta tmp1+1
_fetch:
inc tmp1
bne +
inc tmp1+1
+ lda (tmp1) ; fetch folded byte
sta 0,y
bcc _cleanup ; original C=0 means single byte
iny
clc
bra _fetch ; second byte
_cleanup:
lda tmp1 ; update return address
sta $103,x
lda tmp1+1
sta $104,x
plx ; restore forth stack
rts
I was musting on #46 some more. Rather than saving a byte on xt_literal -> TOS, what about a simple system where words could opt-in to supporting their first argument in the code path rather than the stack? If below makes sense i could try a prototype.
currently Tali compiles
1234 foo ( ... n -- ... )
likeimagine
foo
has an alternate entry pointfoo'
which pullsn
from the code path likeliteral_runtime
. then the compiler could rewind and replacext_literal
in the output stream, saving 3 bytes and potentially improving performance sincefoo'
could avoid a stack push/pop:A simple opt-in macro would add the alternate entry-point at a fixed offset before the normal entry point to indicate a first arg at the return address rather than on the stack. For example:
The header would be unchanged other than flagging
nt_foo
with an unused status bit likeCF=64
. The compiler would generate the literal code normally, setting a flag to remember it just emitted one. When it compiles the word following a literal it would check if the word was opted in via CF status bit. If the bit was set it would rewind and overwritejsr xt_literal
withjsr xt_foo-2
, then advance the code pointer past the literal. If CF not set, it would generate code as normal.