Closed RigTig closed 7 years ago
Sophisticated optimization of CALL
to CALLR
only yields sufficient savings when one has the means to shoving code around (which can be a non-trivial optimization task). I'm sure that there are many cases when back-references have to be resolved, and compiled code would have to be moved by (-1) bytes to "harvest" a saving. In Forth code it's also difficult to take advantage of positive address offsets (which is frequently the case in hand optimized code).
I think that it would be better to travel the "internal metacompilation to ITC" road, i.e. put an inner interpreter for "bytecode" on top of STC.
ITC is effectively a list of addresses rather than a series of calls. ITC may be slightly slower to execute than a subroutine-based forth, but can be significantly smaller. It seems to me that replacing the subroutine-based mechanism with ITC is non-trivial. I am not sure how I can help, but I'll certainly do what I can. Please can you expand somewhat on the meaning of "put an inner interpreter for "bytecode" on top of STC"? In particular, the concept of "on top of".
Maybe I am just going back to Bill Muench's idea of a minimal core and do the rest in Forth. If so, then perhaps a better approach might be to start over with a simple ITC core, and just implement on STM8. If so, then perhaps better to be another repository, probably pinching whatever is useful from STM8EF. Just some thoughts and ramblings. STM8EF has been good for me to get going again and I do not want to throw out the baby with the bathwater!!
Back to STM8EF and COMPILE, I've been trying to figure out exactly what the CALLR addressing actually means. The STM8 programming manual is not as clear as I'd like, at least the bits that I have read. Am I correct that the range of relative addressing is + or - 127 bytes, or equivalent to the following Forth code snippet assuming the input is the address of the CALLR opcode?
1+ C@ DUP $80 < NOT IF $FF00 + THEN +
Note I used '$FF00 +' rather than the preferred '$FF00 OR', simply because + is still in core, but OR is not (and is after this code). Effectively, the second byte of CALLR has its high bit extended to full cell length, and is then added to the effective address (actually being the byte address of next instruction following the CALLR). Thanks for advice and confirmation.
Hi RigTig, you're right about the interpretation of the CALLR
/JRxx
offset. The problem with posiand tive offsets is that we don't know whether the address offset is within a range of +127 bytes (thus the problem with moving code).
You say that "ITC is effectively a list of addresses rather than a series of calls" but that's the case with DTC. ITC is a list of values that translate to code addresses, e.g. by table lookup. What I mean to say is that using 8 bit tokens (index values, byte code in UCSD p-Code or ) it would be possible to pack basic Forth code rather dense.
EXIT
wordSuch an interpreter is very similar to the inner interpreter for a DTC Forth, e.g. the one of the original eForth (page 10). It would be "on top of STC" because both systems can coexist. The inner interpreter would be just something akin to doVAR
. Of course, it would be possible to rewrite the whole STM8EF system in 8 bit ITC but such a system would face problems with the following features:
My proposal is to make it simple: the inner interpreter should do the bare minimum:
exit
: return from interpretingdoLitC
: push chardoLit
: push wordbranch
: relative jump in current bytecode?branch
: conditional relative jump in current bytecodeNo index loops, just relative addressing (100 Forth words between IF
and THEN
should be sufficient), and hand coding (well only the structure words, e.g. IF..THEN
, BEGIN..WHILE..REPEAT
, the other can be "compiled" easy enough). If longer "linear" code segments are to be expected it might work without branch
and ?branch
.
The discussion has moved to issue #27
Here is the .asm for +LOOP:
So, I coded the equivalent in Forth to get:
Now, (+loop) is just been compiled, so the reference is close enough to be optimised to CALLR (+loop). That breaks the operation of the newly compiled +LOOP, since COMPILE is looking for the 3-byte CALL rather than the 2-byte CALLR. Note that the .asm documentation for COMPILE says
However, COMPILE is really supposed to treat the next word in the definition as the item to be COMPILEd, rather than the implementation-dependent jsr.
This issue raises a question about optimisation: does the savings in space from optimisation save enough to cover the extra coding for COMPILE to be made to work correctly?