SamCoVT / TaliForth2

A Subroutine Threaded Code (STC) ANSI-like Forth for the 65c02

Other

28 stars 4 forks source link

Some ideas on shrinking the dictionary headers #101

Open patricksurry opened 2 months ago

patricksurry commented 2 months ago

A few ideas for discussion.

Current state:

Each dictionary entry is 8 bytes plus the name (see below). The name is in lowercase but can include any digits or ascii symbols (presumably ok to restrict no ctrl chr or 8bit?). That's an alphabet of 68 chars (0x32-0x7f is 96 chars minus 26 uppercase, space, and del). The current core dictionary has 277 words with a combined length of 1436 characters, so about 277*8 + 1436 = 3,652 bytes. The length distribution looks like {1: 25, 2: 41, 3: 32, 4: 39, 5: 38, 6: 32, 7: 15, 8: 18, 9: 10, 10: 3, 11: 5, 12: 6, 13: 3, 14: 2, 15: 4, 17: 1, 18: 2, 19: 1}.

;               LSB       MSB
; nt_word ->  +--------+--------+
;          +0 | Length | Status |
;             +--------+--------+
;          +2 | Next Header     | -> nt_next_word
;             +-----------------+
;          +4 | Start of Code   | -> xt_word
;             +-----------------+
;          +6 | End of Code     | -> z_word
;             +--------+--------+
;          +8 | Name   |        |
;             +--------+--------+
;             |        |        |
;             +--------+--------+
;             |        |  ...   | (name string does not end with a zero)
;          +n +--------+--------+

Observations

Most dictionary entries are within 255 bytes of the prior one. Currently in ROM each header points to the following one (at a higher address), whereas in RAM they grow the other way (dictionary entries point to the previous one at a lower address). The order of the ROM ones could easily be reversed so all short entries would point to a preceding one at a lower address. We could use a free status bit to flag a "far" entry which stores a two byte pointer to the next entry, vs a "short" entry which only stores the LSB of the next entry (the MSB is either the same or increments by one if the new LSB is less than the current one). This would save one byte per word, and no extra computation when searching.

Similarly most start-of-code pointers will be within 255 of the previous one. We could have the status bit do double duty - either both entries are a single byte (LSB of next entry), or both are stored as a full bytes. But you'd need to remember the context of the MSB associated with the current NT, or it would be tricky to reconstruct the full XT. If workable it'd save another byte per word.

I think the end-of-code pointer is only used for native compile and SEE. All of the core words (and most user words?) are less than 255 bytes. We could use a single byte length with 255 flagging length of 255 or more so SEE would only disassemble the first 255 bytes, and native compile would automatically be forbidden. This would save another byte per word but would need some special cases in SEE and COMPILE,.

The name itself could be stored hi-bit terminated instead of adding the extra length byte. But currently the dictionary search checks for length match and then (partial) first character match as quick "don't match" tests. That would end up a lot slower, and would need a specialized print routine to convert back to a normal string. The 68 character set doesn't seem to have much scope for squashing, e.g. 4x6bit chars in 3 bytes won't work, and is complex to decode anyway. This is almost half the total storage, so maybe there are some useful tricks, but presumably most will make it harder to actually compare and extract the name strings.

During dictionary search the length check is not that effective as a hash: on average there are about 16 hits on a random word from the core set, and up to 32+ on more common short words. The subsequent first letter check has about 6 hits on a random word (obviously less if we already matched the length). A much better single byte hash is length ^ (initial <o< 3) where the second part of the xor is the first byte of the name circular shifted 3 bits left (%abcd efgh <o< 3 == %defg habc). This hits less than twice on a random word from the core dictionary. In fact a full one byte hash of the full string along with the length is only slightly better.

So a compromise might be to leave the name string as is, but replace the raw length byte with length ^ (initial <o< 3) - that would make the dictionary search more efficient, and it's easy to re-extract the length when needed for full string compare and disasm etc by just calculating hash ^ (initial <o< 3) == length.

patricksurry commented 2 months ago

on second thought I don't think i'd mess with length/name string in the header. doesn't seem worth the added complexity.

a much better way to improve find-name performance would be a small cache of recently found words. e.g. 48 bytes could store 16 single byte hash keys as above with the 16 corresponding two-byte nt addresses. then a fast loop could check if we've recently seen the current string, with a random replacement on cache miss. would need some care to invalidate the cache on word redefinition, marker or wordlist changes. could be a fun experiment.

SamCoVT commented 2 months ago

I agree that any messing with the headers should not make it too compled to understand. I like the fact that I can just dump a section of memory and look at raw headers. I'd be sad if the names were obfuscated in a raw memory dump.

I'm somewhat less worried about compilation speed than I am running speed for words. I think even a simple hashing/caching scheme is going to get very complicated very fast, especially when you consider wordlists (which allow words of the same name to exist simultaneously in other wordlists) and you can change the search order at any time, which would change which version of the word was actually being used at that moment. You can even have your CURRENT wordlist (the one you are compiling into) not be in your search order, for extra messiness.

patricksurry commented 2 months ago

Similarly most start-of-code pointers will be within 255 of the previous one

A more useful observation might be that in the RAM user dictionary most xt_ pointers are within 255 bytes of their nt header, just like the next header is usually close. If we declared headers for native words near the words themselves (rather than listing them all together in headers.asm) that would be true in ROM too. Maybe there is a downside to that? I suppose you might also want to put more commonly used words higher in ROM so their dictionary entries were found faster - I think ROM headers are ordered like this now?

(sorry for this stream of consciousness, this is the kind of thing I think about walking the dog 🐕)

SamCoVT commented 2 months ago

I don't think there is an issue with intermingling headers and words, except for words that "fall into" the next word (there are several). The fact that Tali allows separate headers allows the headers to be anywhere. Are you proposing potentially re-attaching the headers to the "body" of the word? That would mean the pointer in the header to the XT could be eliminated as it would be right after the name and you could calculate the XT from the NT (nt + word_length + offset for all the bytes in the header you keep).

I think Scot (the original author) had listed separate headers for words as a feature of Tali, so I'd need to give it some serious thought. Scot was interested in fast much more than small. Things like locating the NT from the XT would work a little differently, but should be easy to change. I'd have to revisit DOES> (it's a bit of a brain twister) to see how that might work. Deferred words are also a bit funny (I think they currently swap out the XT in the header)

In stream of consciousness fashion, I also think it would be more difficult than it presently is to give snippets of pre-existing ROM code (like OS routines that might already be in ROM) a name if the NT has to be just before the XT. You can currently do that in Tali with the detached headers by just making a header for it.

One other "feature" of separate headers, that Tali does not currently use, is the ability to remove the headers but keep the words. The dictionary entries are there during compiling, but only the "application" word is kept in the main dictionary and all of the headers for the "helper" words can be forgotten. Using this feature requires arranging for the headers to be compiled in a different location, but that's relatively easy to do in CREATE and Tali has been modified this way in the past by users (for putting the headers in a different bank on a banked RAM system).

patricksurry commented 2 months ago

I wasn't thinking of intermingling as fundamentally changing the header structure, nor requiring it to be adjacent to the implementation. I was more thinking about matching the pattern of the dynamic dictionary which tends to grow like header_a ... body_a ... header_b ... body_b .... With that pattern, in most cases header_b could use 8-bit offsets both to point ahead to body_b and back to header_a. Currently the core dictionary uses a different pattern, where we have headers grouped together and ordered so that the "previous" dictionary entry is actually at a higher memory address.

If we intermingled and instead declared each word's header within 256 bytes of the start of the word definition we could use the same "relative header" with 8 bit offsets instead of 16 bit absolute addresses. Depending on how much you care about the search order in the core dictionary, it might also suggest breaking up the alphabetical sorting of word implementations. But that's probably only really important for the very frequently used words which should be near the top of the dictionary search?

leepivonka commented 1 month ago

The name length byte in the header could be split to have some hash bits & some name length bits. This would speed up wordlist searching & not change the header size.

A wordlist could be split into multiple sub-wordlists, selected by a hash of the name. Wordlist searching would be faster but more complex. Word header size remains the same, but wordlist info & code grows.

https://github.com/leepivonka/TaliForth2 is a worked example of:

Code appended to headers & "start of code" pointer removed from header.
"End of code" pointer in header changed to code length & reduced to 1 byte.
Name length byte in header split into 3 bits of hash & 5 bits of name length to speed up wordlist searching.

patricksurry commented 1 month ago

Putting aside the question of faster compilation for now (find-word), here's a scheme that saves 4 bytes on most words.

First imagine a world where:

most words have names less than 128 characters
most word implementations are less than 256 bytes of code
most headers are within 256 bytes after the header they link back to in the dictionary
most word implementations are within 256 bytes following the header
most words have no special status flags set (more below)

Then we could use a 4 byte compact header like this:

nt_word: ; 4 + n bytes offset 0: 7 bit name length with bit 7=0 indicating compact header offset 1: one byte offset back to previous header offset 2: one byte offset forward to word body offset 3: one byte length of word body offset 4..n+3: name string (not zero terminated)

If any of the conditions above are false, we would use a 4 + 6 byte extended header like this:

nt_word: ; 4 + n bytes offset 0: 128 + up to 7 flag bits ; bit 7=1 indicates extended header, bit 0-6 are flags offset 1: 8 bit name length ; if we mandate name length < 128 chars we could keep offset 0 = 128 + 7bit name length, offset 1 as 8 flag bits offset 2-3: pointer to 6 byte extended header offset 4..n+3: name string (not zero terminated) ... nt_ext: ; 6 bytes not necessarily contiguous with nt_word offset 0: address of previous header offset 2: address of word body offset 4: length of word body

This uses 2 extra bytes for extended words with the extra pointer so that we can still write the name string in a fixed location and defer the decision on compact vs extended until after we've written the word body (more below).

Of the conditions, 1 (name < 128) is almost always true and could even be mandated; 2 (body < 256) is almost always true already; 3 and 4 (prev next, word body within 256 bytes) are almost always true already for user words produced by create/compile and could be true for native words if we rearranged the code a bit and interspersed headers near implementations. So 5 (no status flags) is the only tricky one.

At the moment, most flags are rare except UF, NN (all user words) and HC (which I think accidentally is on all user words).

UF is only used by compile, and could be removed if we instead inferred by checking the first three bytes of the word body for $20 followed by an address within 16 bytes of underflow_1. This would remove the need to manually flag each word's underflow status, and we could let users add underflow checking to new words subject to normal stripping rules. (e.g. 2 : plus check-underflow + ; could generate an initial jsr underflow_2 which would get handled automatically.)

NN is currently on all user words for safety, in case they include an internal jmp instruction which is not relocatable. There's an unused bit in the compile status byte which could be used to track a NN flag during compilation. We'd start with NN=0 and any native word that inserts non-relocatable runtime like zbranch_runtime would set NN=1. Then ; would turn the compact header into an extended one on completion, writing the extended block past the end of the word body. Users doing crazy things with native assembly would have to set the flag explicitly with never-native, which would again convert the compact header to an extended one if needed.

I think both the proposed UF and NN changes could be done as independent PRs to explore the theory we committed to any changes to the current header structure.

SamCoVT commented 1 month ago

Hi @leepivonka and thanks for the example of re-attached headers to look at. Reattaching the headers is possibly something to look into, but it reduces the flexibility to do things like put headers in a different page (on paged memory systems), strip the headers, or metacompile. It does eliminate the need for the xt to be saved at all, as it can always be computed from the nt.

I am concerned about extra complexity, as I'd like Tali to be as simple as possible for a beginner/intermediate 6502 programmer to look at the internals and understand what is going on. I'm not sure if I like the split header idea, as it seems like too much complexity to save a few bytes.

What about using a bit to say whether the xt immediately follows the header as a comprimise? This still allows detached headers while also saving 2 bytes on words with attached headers (xt can be removed and calculated as an offset from nt)?

I am interested in trying to have Tali track if NN (Never Native) is actually needed. Most of the complexity will be in the compiling words, and someone looking at the generated code should see not any big surprises. We also already have never-native to allow someone to flag a word if it's doing something "dangerous" or they always want it compiled as a JSR.

I'm also interested in the proposed change to UF - I would argue this might be a reduction in complexity (puts it all into the compiling word and words can't accidentally be flagged) and reclaims a bit in the status byte.

patricksurry commented 1 month ago

107 shows the feasibility of removing UF flag and modifying the test in SEE and COMPILE,

I'm fooling around with tracking whether NN is needed on user words. Most (all?) of the JMPs get compiled by some cmpl_jump... routines in compile.asm so those could flip an unused status bit which is picked up by ; to update the header. First I'll probably tidy xt_create and friends a bit to fix the HC bug and avoid re-writing/deleting the default jsr dovar

patricksurry commented 1 month ago

What about using a bit to say whether the xt immediately follows the header as a comprimise? This still allows detached headers while also saving 2 bytes on words with attached headers (xt can be removed and calculated as an offset from nt)?

I like that idea, saving two bytes on all user words built via create (and native ones if we decide to reorganize headers).

I'd also store code length (z_xxx - xt_xxx) as a single byte instead of z_xxx. Words > 256 could be flagged NN and marked with length 255 tho this case seems v rare. SEE could show a size of '255+' and disassemble the first 255 bytes.

Another thought would be to use a spare bit to indicate a one-byte vs two-byte offset to previous header. Most words could get away with a single byte there. Those changes would still get you to 4 bytes saving on most words

Tangentially - what do you think about documenting and enforcing a max name length of 127 (or 31 or 63 even)? It looks like currently the limit is implicitly 255-8 = 247 before some of the one-byte arithmetic in create would fail. Very long names seem overkill in a 6502 forth: the longest current name is 19 chars for block-ramdrive-init. The extra bit(s) might be used for status or as some hash bits like @leepivonka suggests which would improve list search a bit.

patricksurry commented 1 month ago

Here's what I'm currently considering based on the previous discussions. It assumes that both HC and ST flags are removed though there are still two bits free so that doesn't really matter.

I think it'd also be good to restrict (and enforce) max name length of 31 (or 63) which leaves another 2-3 length bits free. Those could just be zero'd or we could later use them to improve find-name a little.

;   prev NT --> +---------------+
;               :               :
;               :               :
;
;                     ...
;
;  NT start --> +---------------+
;               |  Name length  |   Current max length is 255 - header length 
;               |   (1 byte)    |   (longer may cause unpredictable results)
;               +---------------+
;               | Status flags  |   See description of flag bits below
;               |   (1 byte)    |
;               +---------------+
;               |  Previous NT  |   FP=0: prev NT = NT start - one byte offset
;               |  (1|2 bytes)  |   FP=1: prev NT = two byte pointer
;               +---------------+
;               | Start of code |   DB=0: XT start = NT end
;               |  (0|2 bytes)  |   DB=1: XT start = two byte pointer
;               +---------------+
;               |  Code length  |   Code length for native compile, max 255
;               |   (1 byte)    |   Longer code flagged as NN with size=255
;               +---------------+
;               |  Name string  |   Name is 7-bit lower case ascii
;               :               :
;               :               :   Note string is not zero-terminated
;    NT end --> +---------------+
;                     ...
;  XT start --> +---------------+
;               |  Word body    |   When BD=0 the code body adjoins the header
;               : (65c02 code)  :    
;               :               :
;    XT end --> +---------------+
;
;
;  The total header length is between 4 and 7 bytes plus the the word's name.     
;  Flag bits 0 and 1, FP and DB, can be used to calculate the variable part
;  of the header length directly by adding 4 to `flags & %00000011`.
;  For example a header with both FP and DB equal to zero has length 4,
;  whereas a header with FP=0 and DB=1 has length 6.
;
; Header flag bits
;
;  bit    7    6    5    4    3    2    1    0
;       +----+----+----+----+----+----+----+----+
;       | 0  | 0  | AN | NN | IM | CO | DB | FP |
;       +----+----+----+----+----+----+----+----+
;                   
;       FP - Far previous NT (two byte pointer rather than one byte offset)
;       DB - Disjoint body (two byte pointer rather than adjoining body code)
;       CO - Compile Only
;       IM - Immediate Word
;       NN, AN - Never/always native flags using this table:
;
;            NN  AN
;           +---+---+
;           | 0 | 0 |  -- : Normal word called by JSR (non-native) or inlined (native)
;           | 1 | 0 |  NN : Word can only be called by JSR (never native)
;           | 0 | 1 |  AN : Word can only be inlined (always native)
;           | 1 | 1 |  ST : Normal word with return stack juggling that
;           +---+---+       must be removed when inlining (R>, R@, >R etc)

SamCoVT commented 1 month ago

I'm fine with 31 for the name length limit - in a block, that would be half of a line for just a single name.

You'll need to update never-native and always-native to turn off the other flag, and right now it is possible to set them both (I believe NN wins in that scenario). I'm also OK with leaving the existing bits as they are and using one of the unused bits.

I'll have to look up how deferred words work - those might need to be declared with a disjoint body if the XT is replaced in the header... nope, they have a jmp to DODEFER in the body and the address is replaced there, so those could have an attached header.

The one downside I see to this is that we lose the ability to native compile words greater than 255 bytes. If Tali is going to be used as a metacompiler, that feature could be important as it might be used to create a full binary in a known RAM area that could then be saved by saving only that RAM area (eg. to run on another machine that DOESN'T have Tali installed). I'm not aware of anyone metacompiling with Tali at the moment, but it is something I'm mildly interested in. I think there are some workarounds if we do end up with a 254 byte native compile limit.

patricksurry commented 1 month ago

The one downside I see to this is that we lose the ability to native compile words greater than 255 bytes.

We could use one of the spare bits to flag 1 vs 2 byte code length and maintain that flexibility (LB = large/long body?). Shouldn't be much extra complexity.

I'll mess around with this over the next couple of weeks. I'm sure it'll take a while to figure out where all the hardwired header offset assumptions are and make them more flexible.

I'll probably do a separate small change to enforce a shorter name length and test that.

SamCoVT commented 1 month ago

That seems like a reasonable compromise. I agree with you that the majority of words will be <=255 bytes

SamCoVT commented 3 weeks ago

One note I just realized while looking at your separate pull request using high-bit terminated strings - that's fine for the constant strings in Tali, but the words in the dictionary should allow high-ASCII characters. While the use of them is implementation-dependent according to the standard, it allows use of some non-english languages in the Forth code. Scot, the original author, speaks German and might use a name like Übersquirrel on his system, as that's the name of his single board computer.

That currently works on Tali (except for pasting into a Windows terminal, which is determined to change that first letter to unicode any time I copy/paste - using ALT+220 on numpad works fine, though).