COMPILE doesn't handle optimisation from CALL to CALLR

RigTig commented 7 years ago

Here is the .asm for +LOOP:

;       +LOOP   ( a +n -- )
;       Terminate a DO - +LOOP loop.

        .dw     LINK

        LINK =  .
        .db     (IMEDD+5)
        .ascii  "+LOOP"
PLOOP:
        CALL    COMPI
        CALL    DOPLOOP
        CALL    HERE
        CALL    OVER            ; use mark from DO/FOR, apply negative offset
        DoLitC  14
        CALL    SUBB
        CALL    STORE           ; patch DO runtime code for LEAVE
        JP      COMMA

So, I coded the equivalent in Forth to get:

: +LOOP   ( a +n -- )
  \ Terminate a DO - +LOOP loop.
  [COMPILE] COMPILE (+loop) 
  HERE OVER \ use mark from DO/FOR, apply negative offset
  $0E - ! \ patch DO runtime code for LEAVE
  , ;
  IMMEDIATE

Now, (+loop) is just been compiled, so the reference is close enough to be optimised to CALLR (+loop). That breaks the operation of the newly compiled +LOOP, since COMPILE is looking for the 3-byte CALL rather than the 2-byte CALLR. Note that the .asm documentation for COMPILE says

Compile next jsr in colon list to code dictionary.

However, COMPILE is really supposed to treat the next word in the definition as the item to be COMPILEd, rather than the implementation-dependent jsr.

This issue raises a question about optimisation: does the savings in space from optimisation save enough to cover the extra coding for COMPILE to be made to work correctly?

TG9541 commented 7 years ago

Sophisticated optimization of CALL to CALLR only yields sufficient savings when one has the means to shoving code around (which can be a non-trivial optimization task). I'm sure that there are many cases when back-references have to be resolved, and compiled code would have to be moved by (-1) bytes to "harvest" a saving. In Forth code it's also difficult to take advantage of positive address offsets (which is frequently the case in hand optimized code).

I think that it would be better to travel the "internal metacompilation to ITC" road, i.e. put an inner interpreter for "bytecode" on top of STC.

RigTig commented 7 years ago

ITC is effectively a list of addresses rather than a series of calls. ITC may be slightly slower to execute than a subroutine-based forth, but can be significantly smaller. It seems to me that replacing the subroutine-based mechanism with ITC is non-trivial. I am not sure how I can help, but I'll certainly do what I can. Please can you expand somewhat on the meaning of "put an inner interpreter for "bytecode" on top of STC"? In particular, the concept of "on top of".

RigTig commented 7 years ago

Maybe I am just going back to Bill Muench's idea of a minimal core and do the rest in Forth. If so, then perhaps a better approach might be to start over with a simple ITC core, and just implement on STM8. If so, then perhaps better to be another repository, probably pinching whatever is useful from STM8EF. Just some thoughts and ramblings. STM8EF has been good for me to get going again and I do not want to throw out the baby with the bathwater!!

RigTig commented 7 years ago

Back to STM8EF and COMPILE, I've been trying to figure out exactly what the CALLR addressing actually means. The STM8 programming manual is not as clear as I'd like, at least the bits that I have read. Am I correct that the range of relative addressing is + or - 127 bytes, or equivalent to the following Forth code snippet assuming the input is the address of the CALLR opcode? 1+ C@ DUP $80 < NOT IF $FF00 + THEN + Note I used '$FF00 +' rather than the preferred '$FF00 OR', simply because + is still in core, but OR is not (and is after this code). Effectively, the second byte of CALLR has its high bit extended to full cell length, and is then added to the effective address (actually being the byte address of next instruction following the CALLR). Thanks for advice and confirmation.

TG9541 commented 7 years ago

Hi RigTig, you're right about the interpretation of the CALLR/JRxx offset. The problem with posiand tive offsets is that we don't know whether the address offset is within a range of +127 bytes (thus the problem with moving code). You say that "ITC is effectively a list of addresses rather than a series of calls" but that's the case with DTC. ITC is a list of values that translate to code addresses, e.g. by table lookup. What I mean to say is that using 8 bit tokens (index values, byte code in UCSD p-Code or ) it would be possible to pack basic Forth code rather dense.

The code field contains a call to an interpreter routine (the inner interpreter) that takes 8 bit p-code/bytecode/token,
the inner interpreter essentially makes a table lookup to "translate" the token (the same table proposed for gaining access to code field addresses of unlinked core Forth words),
it then executes the indicated Forth word
It returns after it finds the EXIT word

Such an interpreter is very similar to the inner interpreter for a DTC Forth, e.g. the one of the original eForth (page 10). It would be "on top of STC" because both systems can coexist. The inner interpreter would be just something akin to doVAR. Of course, it would be possible to rewrite the whole STM8EF system in 8 bit ITC but such a system would face problems with the following features:

number of words (the limit is 255)
mix-and-merge with native code
background tasks, and interrupt code in Forth (the inner interpreter needs to be re-entrant)

My proposal is to make it simple: the inner interpreter should do the bare minimum:

fetch code, interpret
- exit: return from interpreting
- doLitC: push char
- doLit: push word
- branch: relative jump in current bytecode
- ?branch: conditional relative jump in current bytecode
- everything else: lookup, execute

No index loops, just relative addressing (100 Forth words between IF and THEN should be sufficient), and hand coding (well only the structure words, e.g. IF..THEN, BEGIN..WHILE..REPEAT, the other can be "compiled" easy enough). If longer "linear" code segments are to be expected it might work without branch and ?branch.

TG9541 commented 7 years ago

The discussion has moved to issue #27

TG9541 commented 7 years ago

Closed with commit 065744e (see issue #27 ). It sure took some time to get this right. See comment.

TG9541 / stm8ef

COMPILE doesn't handle optimisation from CALL to CALLR #28