SamCoVT / TaliForth2

A Subroutine Threaded Code (STC) ANSI-like Forth for the 65c02
Other
29 stars 5 forks source link

constant folding proposal #50

Closed patricksurry closed 4 months ago

patricksurry commented 5 months ago

I was musting on #46 some more. Rather than saving a byte on xt_literal -> TOS, what about a simple system where words could opt-in to supporting their first argument in the code path rather than the stack? If below makes sense i could try a prototype.

currently Tali compiles 1234 foo ( ... n -- ... ) like

    jsr xt_literal
.word 1234
    jsr xt_foo

imagine foo has an alternate entry pointfoo' which pulls n from the code path like literal_runtime. then the compiler could rewind and replace xt_literal in the output stream, saving 3 bytes and potentially improving performance since foo' could avoid a stack push/pop:

    jsr xt_foo'
.word 1234

A simple opt-in macro would add the alternate entry-point at a fixed offset before the normal entry point to indicate a first arg at the return address rather than on the stack. For example:

.macro CF_OPTIN 
    bra \1      ; or we could set a flag like C=1/0
.endmacro

CF_OPTIN foo_inline_handler   
xt_foo:
    ... original code 
foo_inline_handler: 
   ... deal with first arg on the stack, e.g.
    jsr fetch_following_literal_as_A+Y
    ... do stuff directly with A/Y or push to stack and use original code path
z_foo:

The header would be unchanged other than flagging nt_foo with an unused status bit like CF=64. The compiler would generate the literal code normally, setting a flag to remember it just emitted one. When it compiles the word following a literal it would check if the word was opted in via CF status bit. If the bit was set it would rewind and overwrite jsr xt_literal with jsr xt_foo-2, then advance the code pointer past the literal. If CF not set, it would generate code as normal.

SamCoVT commented 5 months ago

This sounds very cool, and I'd be very interested to see if you could make it work, but it's probably not something I'd add to Tali because of the added complexity and my own limited time to maintain Tali. I feel it also might make the disassembly even less clear than it currently is for new users.

Tali is public domain, so you are welcome to take as much or little as you like and create your own Forth that you maintain separately. Because Forth is a little like a scaffold for you to build the language you actually want, it's generally expected that folks will fork Tali and customize it heavily for their specific use case.

I am generally interested in things that make Tali (in descending order of interest): Easier to understand or use Simpler Faster Smaller

One issue that I would very much appreciate help in is speeding up DO loops. This is one place that Tali is not particularly fast, but it's a pretty tricky topic and I don't know how much speedup is even possible.

patricksurry commented 5 months ago

wow do does look pretty heavy in both space and cycles

patricksurry commented 5 months ago

played with this idea a little. a simplified version is just to add a static six byte prologue in front of any word you want to opt in for constant folding. this basically just adds a jsr literal_runtime entrypoint in front of such words. the compiler works as described above: it remembers if the last thing it compiled was a literal. if the current word supports constant folding and isn't inlined, it can rewind and replace jsr literal_runtime with jsr word-6 or jsr word-4 to indicate that the subsequent 2 or 1 bytes should be pushed to TOS before falling thru normally to the word. (if the literal msb is 0 it rewinds the cp one byte and uses the one-byte entrypoint) essentially just using the word's predefined entrypoint for literal_runtime instead of adding one in the code stream.

This trades 6 bytes for each optin word definition for saving 3-4 bytes each time the word is called immediately after a literal. In my code base which compiles to about 20K (with nc-limit 0), there are about 500 literal references (mostly single bytes). Opting in ten common native words would save about 1.2K of generated code. Exposing a flag for user words to also opt-in would increase potential saving to 10%.

; prologue to handle inline literal
; word-6 entry for 2 byte trailing literal
        sec                 ; 2 byte word is stored after RTS
        .byte $24           ; mask clc using BIT zp so C unchanged (bcs is +1 byte, -0 cycle)
; word-4 entry for 1 byte trailing literal
        clc                 ; 1 byte is stored after RTS; MSB is 0
        jsr fetch_literal   ; pushes inline literal to TOS
xt_word:        ; normal entry point
       ... code unchanged ...

the fetch_literal code looks like this:

fetch_literal:

; We want to fetch the one (C=0) or two (C=1) byte constant folded into the code stream
; to TOS  The calling seuqence was like this:
;
;               ; compiled code stream
;               jsr xt_word-6       ; enter prologue
;       data:   lsb msb             ; folded data
;               jsr xt_next         ; next word
;               ...
;
;               ; the xt_word prologue where we've been called
;       xt_word-6:
;               ...
;               jsr fetch_literal
;       xt_word:
;               ...
;
; all of which to say the stack currently looks like this:
;
;       $100+SP:
;               .byte ?             ; next free space on stack
;               .word xt_word-1
;               .word data-1        ; fetch and increment
;
; we need to fetch the byte(s) at data, and increment the return address

        dex             ; make space on forth stack
        dex
        stz 1,x         ; zero MSB in case C=0
        txa
        tay             ; move forth stack ptr to Y ...
        tsx             ; ... so we can move cpu SP to X
        phy             ; also stash forth stack so we can modify Y

        lda $103,x      ; #<(data-1)
        sta tmp1
        lda $104,x      ; #>(data-1)
        sta tmp1+1

_fetch:
        inc tmp1
        bne +
        inc tmp1+1
+       lda (tmp1)      ; fetch folded byte
        sta 0,y
        bcc _cleanup    ; original C=0 means single byte
        iny
        clc
        bra _fetch      ; second byte

_cleanup:
        lda tmp1        ; update return address
        sta $103,x
        lda tmp1+1
        sta $104,x

        plx             ; restore forth stack
        rts