Avoid underflow checks in internal calls to native words

patricksurry commented 5 months ago

One thing I noticed profiling tests was how heavily the underflow_n routines are used. One reason for that is that all internal calls to native words includes the underflow testing. For example jsr xt_dup occurs in 44 places. Given that the coverage tests and quality of internal words is good, it seems unnecessary to include the underflow test.

They could be avoided in a couple of ways:

explicitly do jsr xt_dup+3 but that's horrible and hard to remember which words have underflow
add a new internal label like xt_dup_no_underflow to all these words
move the xt_dup label after the underflow test, and make the header macro correct for it. This would save changing any existing code that calls native words - jsr xt_dup would just go a bit faster.

I like the third option but would love your opinion. For example if we wrote:

                jsr underflow_1
xt_dup:
                dex
                dex

                lda 2,x         ; LSB
                sta 0,x
                lda 3,x         ; MSB
                sta 1,x

z_dup:          rts

then we could use a macro like header(dup, <optional string name>, flags) to generate the header records, which adjusts to xt_dup-3 when the UF flag is set. Using a header() macro would also simplify headers.asm and avoid need for most of the nt_xxx symbols.

I think the underflow check is hard to avoid in the few words that also have stack juggling via strip_table, but I don't think those are used much internally anyway.

patricksurry commented 5 months ago

Here's work-in-progress for how the headers could be simplified with a macro: https://github.com/patricksurry/TaliForth2/blob/skip-internal-underflow/words/headers.asm

That would make it easy to move the xt_ tag after underflow and adjust the macro accordingly

patricksurry commented 5 months ago

BTW, is the z_<word> part of the dictionary header used for anything other than xt_wordsize for see and compile,? There's only one word that longer than 255 bytes (s_quote is 260 bytes) so could argue for storing only a single byte length in the header, and automatically flagging anything longer than 255 as NN. This would save just under 0.5Kb on the core image, and a byte on every new word, plus make inline compile a little simpler/faster.

(Would also be easy to shorten s_quote by moving the lookup table _esc_tr_table out of the word body.)

patricksurry commented 5 months ago

it also feels like you might be able to get rid of a second header byte since next header offset is most often - but not always - within 256 bytes. but maybe the conditional logic would be too expensive.

SamCoVT commented 5 months ago

Testing is now so fast with c65 that I don't see the excessive underflow testing as an issue. It's true that many of Tali's words come with underflow checking at the beginning, but the user has the option of using uf-strip (just set it to TRUE) to remove all of these checks from their compiled code (it skips over the first three bytes, like you are suggesting).

I don't think z- in the dictionary is used anywhere else, so using a byte for that is interesting. The value of 255 could be used as a sentinel that the value is elsewhere (perhaps after the name, although that's a bit of juggling to access). There may be a bit left in the status (flags) portion of the header, in which case that could determine if it's an 8-bit length or a 16 bit value (either length or end location). That's probably the best bet. I'd be much more interested in exploring this vs the underflow checking.

patricksurry commented 5 months ago

Yes, the top level of underflow gets stripped, e.g. if I use dup in my forth it would strip that underflow check.

But if I use (say) . it strips one underflow from xt_dot itself, but that word has native calls to dup (uf), abs (uf), zero, less_number_sign, number_sign_s (uf), rot (uf), sign (uf), number_sign_greater (uf), type (uf) and space. So it always runs at least seven more underflow checks that aren't stripped just at the top level, with more checks happening in any of those words that call other native word implementations.

If we move the xt_xxx label after the underflow, the internal calls would skip the underflow checks, but calls from forth could either include or skip the check based on the strip-underflow setting. (The header macro would adjust the xt_xxx label for underflow so the dictionary stays unchanged)

patricksurry commented 5 months ago

i'll try a small test with a word like . to see what the potential difference looks like

SamCoVT commented 5 months ago

I think I like having xt_xxx point to to the actual beginning of the word (including it's depth check), but I'd be happy with an additional label for each word that could be used for assembly programming, eg. you might say:

; ## DOT ( u -- ) "Print TOS"
; ## "."  auto  ANS core
        ; """https://forth-standard.org/standard/core/d"""

xt_dot:
                jsr underflow_1
asm_dot:
                jsr asm_dup                      ; ( n n )
                jsr asm_abs                      ; ( n u )
                jsr asm_zero                     ; ( n u 0 )
                jsr asm_less_number_sign         ; ( n u 0 )
                jsr asm_number_sign_s            ; ( n ud )
                jsr asm_rot                      ; ( ud n )
                jsr asm_sign                     ; ( ud )
                jsr asm_number_sign_greater      ; ( addr u )
                jsr asm_type
                jsr asm_space

z_dot:          rts

It doesn't have to be asm as the prefix - perhaps nc_xxx for no check. Words that don't have underflow checking will have xt_xxx and asm_xxx labels at the same place. This makes it a label/assembler issue and you can choose checked (xt_xxx) or unchecked (asm_xxxx) when writing your code for tass64. When you are in Tali, you can choose checking or not with strip-underflow[edited to fix name], although I suppose that will be defeated if you use nc-limit set to 0? I'd have to look at that code and see exactly what it does, but we could modify the compiler to compile a jsr to xt+3 if the word has underflow checking (there is a bit in the header that indicates this) and nc-limit prohibits native compiling.

This will break the disassembly labels unless we teach disasm about it. I don't mind disasm being slow, so it could search for (jsr_target - 3) if it doesn't find the xt on the first pass. I can't think of a situation where that would give an incorrect label off the top of my head, but that would be something to check for.

I think . is a good word to test out with - issue a PR if you have something you like and we can take a look at it and see what would be required to implement it.

SamCoVT commented 5 months ago

I think this means I like your option 2 the best. I don't like things that tie Tali too strongly to a particular assembler. As it is, moving to a different assembler would require handling some of the "weak" stuff that we've used recently but I don't think that's too painful as you can just delete the stuff you don't want (or delete the weak statement and use the default). I'm a bit afraid to use macros too heavily as they work differently on different assemblers and have different levels of capability on different assemblers. I also like that you can look at the code and see exactly where every byte comes from or goes. Adding macros will hide a lot of that and make the structure more obtuse for those that want to work with the internals.

If we do go the label route, I think it's important to have a uniform naming structure that will work for both words that do and don't have underflow checking so the programmer doesn't have to look them up.

SamCoVT commented 5 months ago

Looking at compile,, It already calculates the +3 address, but then throws it away if it decides to compile as a JSR because of nc-limit. That could be changed to use the +3 address in the JSR. It's a bit tricky as that code has two different codepaths that reach it with different things on the stack once you get there, but adding in a new branch target should make it work.

SamCoVT commented 5 months ago

If a new prefix is used for post-check labels (rather than doing it as a suffix), and all words have these labels added, then I think it should be safe to search/replace jsr xt_ with jsr asm_ (or whatever the new prefix is).

patricksurry commented 5 months ago

I did a little test with xt_dot. TL;DR is that skipping internal underflow checks when possible saves about 7% on cycle count. I'm test-driving the prefix w (for word) which seems to read nicely as an 'internal prefix' between xt_word and z_word. But we can always search and replace to whatever, I don't have a strong preference.

This led me to explore xt_number_sign a bit more, since the return stack operators are slow and can't avoid the internal underflow, which generated some unrelated savings:

7067 ticks from xt_dot+3 to z_dup in original code 6603 ticks from w_dot to z_dot (skipping internal underflow checks when possible; 7% faster) 5955 ticks after refactoring the inline ud/mod in# to avoid return stack operators (16% faster) 3515 ticks after short-circuiting one pass of um/mod in # when the hi word is 0 (50% faster)

So I think I'll carry on. A nice benefit of leaving the xt_ prefixes and just adding the new ones is that I can change incrementally and the assembler helps me find cases where I haven't added the new entrypoint yet.

Are there any standard benchmark words you run to assess overall performance? It would be nice to have a make cbench target to track progress.

Here's how dot currently:

xt_dot:
                jsr underflow_1
w_dot:
                jsr w_dup                      ; ( n n )
                jsr w_abs                      ; ( n u )
                jsr w_zero                     ; ( n u 0 )
                jsr w_less_number_sign         ; ( n u 0 )
                jsr w_number_sign_s            ; ( n ud )
                jsr w_rot                      ; ( ud n )
                jsr w_sign                     ; ( ud )
                jsr w_number_sign_greater      ; ( addr u )
                jsr w_type
                jsr w_space

z_dot:          rts

SamCoVT commented 5 months ago

That's a pretty significant savings in cycles. At first I did not fully comprehend how many stack depth checks might be performed in a single word. Most of my work in Tali has been writing in full assembly rather than jsr type programming, but Tali has a lot of words that use other words so this can lead to a substantial speedup. It's also nice that the xt_xxx label can still be used if you want the depth check. It's probably worth looking at the first word or two of words that don't do a stack depth check to see if they would benefit from leaving in the stack checking of the first word or two that they call (or add a stack depth check to them). Here's one I found after a quick search:

; ## BUFFER_COLON ( u "<name>" -- ; -- addr ) "Create an uninitialized buffer"
; ## "buffer:"  auto  ANS core ext
                ; """https://forth-standard.org/standard/core/BUFFERColon
                ; Create a buffer of size u that puts its address on the stack
                ; when its name is used.
                ; """
xt_buffer_colon:
                jsr xt_create
                jsr xt_allot
z_buffer_colon: rts

The xt_allot should stay xt_allot (or a stack depth of 1 should be checked)

I like w_ as the prefix, while keeping xt_ as the actual xt for the word.

I do think it's worth making strip-underflow work even when nc-limit is zero. The code that's there now ALMOST does it (it calculates the +3 starting point, but then throws it away and uses the original xt). The disassembler should also be updated to know that words can start at +3 as well.

I can use some editor macros to automate most of this work, so if you like how it looks with the words you are testing I'm happy to change it all over on a weekend if you don't want to do it manually. We could also divvy up the work - I'd be happy to change all of the words over if you want to work on COMPILE, and disasm or vice versa.

We currently don't have any benchmarking except for the cycle count tests. When the tests were put into place, we were still working on getting Tali to be fully functional and ANS-2012 compliant - and less concerned with optimizing it (many words were still in Forth, which is why we have both forth_words.fs and user_words.fs - I think that could be simplified down to just one now). You are welcome to add cycle tests for words that don't currently have them, if that's adequate for your benchmarking needs. You could make a Makefile target cbench that just runs those tests, and you now have the total tick count at the bottom (when running bye) that would be a good "overall" indicator.

patricksurry commented 5 months ago

The xt_allot should stay xt_allot (or a stack depth of 1 should be checked)

I'd vote for all internal calls to words to go via the w_ entrypoint, and add underflow checking at the top level for any words where it makes sense. Otherwise any new native implementation that calls the word won't be able to avoid the lower level underflow. It also makes it easier to reason about the code rather than have to think carefully about which version of each word to call.

There are a couple others noted in the headers: erase and blank say underflow is checked by fill, but I think it's better if they have explicit underflow and then call w_fill internally. That way any native word built on top of fill would also avoid the lower level underflow check.

I don't mind having a first pass on the search and replace from jsr xt to jsr w and then adding enough w_ entrypoints to make it build. Then we can review any words that need underflow added, and work on your disasm and strip-underflow points.

I was also going to try to shrink the header from 10 to 8 bytes as mentioned. I'll put up a work in progress PR for you to look at

SamCoVT commented 5 months ago

I'd vote for all internal calls to words to go via the w_ entrypoint, and add underflow checking at the top level for any words where it makes sense. That makes sense to me.

I'll work on COMPILE, (you can still add a w_compile_comma: label to it if you want as I will be editing down in the word itself). It should work using just the xt and the UF flag, regardless of whether there is a w_ label or not. I can also work on disasm so it will report the name even with a jsr xt_whatever+3 (I only have to edit the disasm_jsr: handler). I'll put it up as a separate PR as well so you can see what is going on there, but I think it shouldn't be too much trouble to merge both of them once we are happy.

SamCoVT commented 5 months ago

There are a couple others noted in the headers: erase and blank say underflow is checked by fill, but I think it's better if they have explicit underflow and then call w_fill internally. That way any native word built on top of fill would also avoid the lower level underflow check.

I agree with this. It might be worth a quick look through all the words that DON'T have the UF flag in their headers.

patricksurry commented 5 months ago

yup, i did at least a cursory pass and added a handful of new checks in the PR

leepivonka commented 5 months ago

The code used to have labels like xt_dup_nouf after the underflow check.

Another option for never-native words is to include cpx & bcs instead of the jsr underflow_n . That reduces the overhead from 16 to 4 cycles, but slightly increases the code size & would require a different strip algorithm.

SamCoVT / TaliForth2

Avoid underflow checks in internal calls to native words #93