Closed rtrussell closed 1 year ago
In order to minimise stack usage and leave the maximum amount of memory free for the BASIC program, my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c in which some large case statements have been split into multiple subroutines. To use the original code, the program would have to be re-compiled using the option MIN_STACK=N.
As you observed, the routine math_divide was reporting a "Number too big" error rather than "Division by zero". I have now committed a change to test for a zero divide and issue the correct error message in this case.
I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.
I have not yet posted updated UF2 files.
my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c
I see. My understanding was that you were using BBCSDL as a submodule, so I could be confident that any changes I made to my repository (such as the recent revamping of the way errno is used, to make it possible for BASIC programs to read that value) would be automatically incorporated in the Pico build. Have you been careful to track changes I have made and incorporate them in your versions?
My users are entitled to expect that all the Console Mode editions of BBC BASIC available from my web site are built from the same source code. Indeed I have unwittingly stated that explicitly in release notes, believing it to be the case. So can you tell me exactly what I need to change in my build process to ensure that I am, indeed, using the original source files? I'm sure any increased stack usage won't matter in the vast majority of BASIC programs.
I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.
I don't consider that an "incorrect" report. There are no practical circumstances when it matters whether the divisor is precisely zero or not, all that matters is that the quotient overflows the number range of the output. Indeed in versions of BBC BASIC which use 40-bit floats 1E-200 is precisely zero, but in versions which use 64-bit floats it isn't. Similarly 1E-400 is precisely zero in versions which use 64-bit floats, but isn't when 80-bit floats are used.
my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c
So can you tell me exactly what I need to change in my build process to ensure that I am, indeed, using the original source files? I'm sure any increased stack usage won't matter in the vast majority of BASIC programs.
To build using your versions of bbeval.c and bbexec.c, build use the following command:
make MIN_STACK=N
For confirmation the CMake output should show
-- BBC BASIC Upstream source from ../../BBCSDL
-- Using upstream expression evaluation code -- CONFIRMATION
-- Using memory protection to detect stack overrun
-- Pico W GPIO (LED) only support
-- Generate high quality sound using PWM on core 1
-- Including LFS filesystem for flash storage
-- BBC Basic console I/O will be on USB or UART
-- Link with 2KB multicore stack
-- Configuring done
-- Generating done
I see. My understanding was that you were using BBCSDL as a submodule, so I could be confident that any changes I made to my repository (such as the recent revamping of the way errno is used, to make it possible for BASIC programs to read that value) would be automatically incorporated in the Pico build. Have you been careful to track changes I have made and incorporate them in your versions?
Yes I have been tracking your changes.
My users are entitled to expect that all the Console Mode editions of BBC BASIC available from my web site are built from the same source code. Indeed I have unwittingly stated that explicitly in release notes, believing it to be the case.
The Pico code (even the console builds) is clearly not the same as BBCSDL:
I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.
I don't consider that an "incorrect" report. There are no practical circumstances when it matters whether the divisor is precisely zero or not, all that matters is that the quotient overflows the number range of the output.
If the output overflows the number range then surely the correct error is "Number too big". And infinity could be considered to be too big,
Indeed in versions of BBC BASIC which use 40-bit floats 1E-200 is precisely zero, but in versions which use 64-bit floats it isn't. Similarly 1E-400 is precisely zero in versions which use 64-bit floats, but isn't when 80-bit floats are used.
Which illustrates that not all versions of BBC BASIC behave identically.
Whether 1E-200 can be considered to be effectively zero depends upon the application, which you do not know. With 64 bit floats there are more than 10^17 representable values between zero and 1E-200, spanning more than 100 orders of magnitude,
The Pico code (even the console builds) is clearly not the same as BBCSDL:
I hope you are not being deliberately perverse just to be irritating! :-(
When I say that I expect different editions to use the same source code I am of course referring to the generic source code of the BBC BASIC interpreter. If you look at the README of my BBCSDL repository you will find an architecture diagram which shows the generic modules and those that are specific to each platform and type of build.
I would expect all versions (at least, those which call themselves 'BBC BASIC') to use the files shown in green, and all the Console versions to use the files shown in the red box. The Pico edition is specifically catered for by the inclusion of the file bbpico.c in grey. Of course there may well be a few instances of #ifdef PICO
directives in what are otherwise generic modules, but that's no different in principle from the #ifdef __WINDOWS__
or #ifdef __aarch64__
etc. directives that are inevitable.
The issue of stack growth is a tricky one, not least because (as I understand it) it arises because of poor compiler design, in lumping the stack requirements of all the clauses in a switch statement into one large stack frame reserved at the start of the function rather than individually for each clause. One of the things always emphasised in C coding guidelines is never to try to 'outwit' the compiler, so for example always use the native switch statement rather than a jump table.
Compilers have got better in their stack handling, for example a function with an 'early out' may well now not reserve its full stack frame until after that return which is an obvious efficiency benefit that any assembly-language programmer would automatically take advantage of. If the claim that modern compilers can outperform human assembly language programmers is to be justified, they will also have to allocate stack more intelligently in functions containing large switch statements.
But meanwhile I'm not sure what the best approach is. If the changes to reduce stack usage (at the expense of speed) are sufficiently minor that they can be covered with an #ifdef FAVOUR_STACK_SIZE_OVER_SPEED
clause then I'd consider incorporating them in the generic modules.
The way I always do my releases is that I rebuild all the editions from the revised source(s) except the Pico, test them on their respective platforms, upload the binaries to my website and only then commit the changes to my GitHub repository (so I don't commit buggy code). I then clone and build the Pico edition, knowing that it will pick up the changes in my commit; but that means there may well be only minutes between committing the revised source files and the build, so obviously there is no opportunity for you to 'step in' and incorporate any changes I may have made.
I don't see how else this can work, so if I've made changes to any of the generic modules I need to know that they will be incorporated in the Pico build as well as all the others. Perhaps you can give some thought to this and advise on how you think we should best proceed. I don't want to have to omit the Pico edition from those that are available as binary downloads from my site.
allocate stack more intelligently in functions containing large switch statements.
Here's an idea; I don't know if it will work but if it does it's potentially a very elegant way of forcing the compiler to allocate a stack frame 'local' to the individual case
clause rather than during the initialisation of the parent function containing the switch
statement.
GCC supports nested functions (Clang/LLVM doesn't); using this extension it's possible to declare a function inside another function. So, using a simple #if
condition we can convert the existing case
clause to a function call thus:
case TFN:
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
nestedTFN();
void __attribute__ ((noinline)) nestedTFN(void)
#endif
{
int errcode = 0 ;
jmp_buf savenv ;
heapptr* savesp ;
...
}
(not shown here is the function prototype necessary to avoid an error as a result of the forward-reference).
If the #if
condition evaluates true the existing conventional inline case
clause is converted to a function call. Hopefully, as a result, any stack requirements of that clause will be allocated locally, not in the parent function.
I've demonstrated that this compiles and runs, but what I don't know is whether it reduces the stack usage for real. I also don't know whether it optimises out the actual function call (but even if it doesn't, unconditional function calls with no parameters often take zero cycles on a modern pipelined CPU).
If it works well, I would be entirely happy to incorporate it in my 'master' source files because the code would be completely unchanged except for the REDUCE_STACK_SIZE case.
The way I always do my releases is that I rebuild all the editions from the revised source(s) except the Pico, test them on their respective platforms, upload the binaries to my website and only then commit the changes to my GitHub repository (so I don't commit buggy code). I then clone and build the Pico edition, knowing that it will pick up the changes in my commit; but that means there may well be only minutes between committing the revised source files and the build, so obviously there is no opportunity for you to 'step in' and incorporate any changes I may have made.
As I have remarked previously, you don't necessarily have to push your source to GitHub before building the Pico version. You can do:
make BBC_SRC=<Path to local BBCSDL folder>
This will use the source in the specified folder instead of that from GitHub.
The issue of stack growth is a tricky one, not least because (as I understand it) it arises because of poor compiler design, in lumping the stack requirements of all the clauses in a switch statement into one large stack frame reserved at the start of the function rather than individually for each clause. One of the things always emphasised in C coding guidelines is never to try to 'outwit' the compiler, so for example always use the native switch statement rather than a jump table.
The problem is what target the compiler is optimising for. Certainly GCC seems to be aimed primarily at systems with large amounts of memory. Optimising for size gets less attention. Indeed when Eric tried using -Os, compiler bugs caused the program to crash. I have not tried using -Os recently.
Here's an idea; I don't know if it will work but if it does it's potentially a very elegant way of forcing the compiler to allocate a stack frame 'local' to the individual
case
clause rather than during the initialisation of the parent function containing theswitch
statement.
The only way to find out would be to modify all of at least one large case statement in this fashion, compile it then look at the disassembly to see how much stack space is being allocated.
In a similar way, I was wondering whether it would be possible to automate the generation of bbeval2.c and bbexec2.c from the original bbeval.c and bbexec.c. Mostly each case clause is simply copied into a separate function body. By putting some specific comment statements in the original source to mark key locations it may be possible to achieve this.
As I have remarked previously, you don't necessarily have to push your source to GitHub before building the Pico version.
The 'local' BBCSDL folder isn't on the Raspberry Pi that I use to build the Pico, nor on a network drive. But anyway it only makes the issue at hand even worse - you then wouldn't have even a few minutes to update your custom versions with whatever changes I had made, because you wouldn't even know I had made any changes! :-)
Optimising for size gets less attention. Indeed when Eric tried using -Os, compiler bugs caused the program to crash. I have not tried using -Os recently.
I'm pretty certain that's to reduce code size, and doesn't reduce stack usage at all.
The only way to find out would be to modify all of at least one large case statement in this fashion
I'm happy to do that, but I would need to know which case
clauses need the treatment and which don't (many have no local stack requirements at all, over and above what's needed for the function preamble, I would expect). Indeed one advantage of this method is that one can choose any subset to be turned into function calls without affecting the rest.
Do you have any statistics, based on tests you might have done, on which are the greatest culprits in stack use?
I have attempted to revise routine item () along the lines you suggest.
#define REDUCE_STACK_SIZE
VAR item (void)
{
VAR v ;
signed char al = nxt () ;
esi++ ;
switch (al)
{
/************************* Parenthesised expression ****************************/
case '(':
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
VAR __attribute__ ((noinline)) item_OBKT(void)
{
#endif
v = expr () ;
if (nxt () != ')')
error (27, NULL) ; // 'Missing )'
esi++ ;
return v ;
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
}
return item_OBKT ();
#endif
/************************************* FN **************************************/
case TFN:
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
VAR __attribute__ ((noinline)) item_TFN(void)
{
#endif
{
int errcode = 0 ;
jmp_buf savenv ;
heapptr* savesp ;
procfn (TFN) ;
memcpy (&savenv, &env, sizeof(env)) ;
savesp = esp;
errcode = (setjmp (env)) ; // <0 : QUIT, >0 : error, 256: END
if (errcode)
{
if ((errcode < 0) || (errcode > 255) || (errtrp == 0) ||
(onersp == NULL) || (onersp > savesp))
{
memcpy (&env, &savenv, sizeof(env)) ;
longjmp (env, errcode) ;
}
esi = errtrp + (signed char *) zero ;
esp = onersp ;
}
v = xeq () ;
memcpy (&env, &savenv, sizeof(env)) ;
}
return v ;
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
}
return item_TFN ();
#endif
Full version attached.
However I am obtaining build errors:
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:997:3: error: a label can only be part of a statement and a declaration is not a statement
VAR __attribute__ ((noinline)) item_OBKT(void)
^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:998:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: warning: implicit declaration of function 'item_OBKT' [-Wimplicit-function-declaration]
return item_OBKT ();
^~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
return item_OBKT ();
^~~~~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1014:3: error: a label can only be part of a statement and a declaration is not a statement
VAR __attribute__ ((noinline)) item_TFN(void)
^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1015:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: warning: implicit declaration of function 'item_TFN'; did you mean 'itemi'? [-Wimplicit-function-declaration]
return item_TFN ();
^~~~~~~~
itemi
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
return item_TFN ();
^~~~~~~~~~~
Not sure what I am doing wrong.
Edit: Found the problem. Need a Brace after the case statement and before the nested function.
I'm happy to do that, but I would need to know which
case
clauses need the treatment and which don't (many have no local stack requirements at all, over and above what's needed for the function preamble, I would expect). Indeed one advantage of this method is that one can choose any subset to be turned into function calls without affecting the rest.Do you have any statistics, based on tests you might have done, on which are the greatest culprits in stack use?
The routines which I have split into multiple functions are:
I don't have any statistics on the stack usage of individual case clauses. All that the disassembly gives is the total stack usage of the function. I suppose that the disassembly of my revised code will give the stack usage of the separated functions, which might give an indication of the stack used by the original case clauses.
It looks as though this works. For the original code, the disassembly shows:
100137f4 <item>:
{
100137f4: b5f0 push {r4, r5, r6, r7, lr}
100137f6: 46ce mov lr, r9
100137f8: 4647 mov r7, r8
100137fa: b580 push {r7, lr}
100137fc: b0f5 sub sp, #468 ; 0x1d4
100137fe: 9007 str r0, [sp, #28]
So 468 bytes of stack reserved.
With the revised code (attached):
10014898 <item>:
{
10014898: b5f0 push {r4, r5, r6, r7, lr}
1001489a: b08b sub sp, #44 ; 0x2c
1001489c: 0004 movs r4, r0
So only 44 bytes. There is then some extra stack used by the separate cases, the largest I have found is an additional 68 bytes:
100158f8 <item_TRAD.7314>:
{
100158f8: b530 push {r4, r5, lr}
100158fa: b091 sub sp, #68 ; 0x44
100158fc: 0005 movs r5, r0
xrad.i.n = 0x3F91DF46A2529D39L ;
I leave math() and xeq() to you.
I have attempted to revise routine item () along the lines you suggest.
Oh, it sounds as though we have both been working on it! Please try my version which has the top 20-or-so biggest culprits for stack usage in item() changed to function calls (for test purposes only REDUCE_STACK_SIZE is defined). Sorry you have wasted your time, I had no expectation you would work on it yourself.
I leave math() and xeq() to you.
I don't believe there is anything worth doing in math(), the most you could save would be 8 bytes of stack. And because math() doesn't call item() or expr() or anything similar the stack usage is not cumulative in a re-entrant function.
I'll have a look at xeq(), but I doubt that there are many statements that are likely to need treatment.
pi@raspberrypi:~/pico/PicoBB/BBCSDL/src $ wget http://www.rtr.myzen.co.uk/bbeval.c --2023-03-25 17:22:59-- http://www.rtr.myzen.co.uk/bbeval.c Resolving www.rtr.myzen.co.uk (www.rtr.myzen.co.uk)... 212.23.8.80 Connecting to www.rtr.myzen.co.uk (www.rtr.myzen.co.uk)|212.23.8.80|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2023-03-25 17:22:59 ERROR 404: Not Found.
ERROR 404: Not Found.
Strange. I've re-uploaded it as a zip, perhaps that will work better.
10012a34 <math>:
VAR math (VAR x, signed char op, VAR y)
{
10012a34: b084 sub sp, #16
10012a36: b5f0 push {r4, r5, r6, r7, lr}
10012a38: 46ce mov lr, r9
10012a3a: 4647 mov r7, r8
10012a3c: b580 push {r7, lr}
10012a3e: b097 sub sp, #92 ; 0x5c
10012a40: 0007 movs r7, r0
10012a42: 911f str r1, [sp, #124] ; 0x7c
10012a44: 9220 str r2, [sp, #128] ; 0x80
10012a46: 9321 str r3, [sp, #132] ; 0x84
10012a48: ab23 add r3, sp, #140 ; 0x8c
10012a4a: 2600 movs r6, #0
10012a4c: 579e ldrsb r6, [r3, r6]
errno = 0 ;
math() reserves 92 bytes
1000e9b8 <xeq>:
// Execute the program:
VAR xeq (void)
{
1000e9b8: b5f0 push {r4, r5, r6, r7, lr}
1000e9ba: 46ce mov lr, r9
1000e9bc: 4647 mov r7, r8
1000e9be: b580 push {r7, lr}
1000e9c0: b0cb sub sp, #300 ; 0x12c
1000e9c2: 900b str r0, [sp, #44] ; 0x2c
1000e9c4: e141 b.n 1000ec4a <xeq+0x292>
void *tmpesi ;
while (1) // for each statement
And xeq() reserves 300 !
Many errors of the same form as I had earlier:
VAR __attribute__ ((noinline)) item_OBKT(void)
^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:998:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: warning: implicit declaration of function 'item_OBKT' [-Wimplicit-function-declaration]
return item_OBKT ();
^~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
return item_OBKT ();
^~~~~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1014:3: error: a label can only be part of a statement and a declaration is not a statement
VAR __attribute__ ((noinline)) item_TFN(void)
^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1015:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: warning: implicit declaration of function 'item_TFN'; did you mean 'itemi'? [-Wimplicit-function-declaration]
return item_TFN ();
^~~~~~~~
itemi
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
return item_TFN ();
^~~~~~~~~~~
math() reserves 92 bytes
But it doesn't matter! All that matters is the stack usage of functions which can be called recursively, because they in turn call a function (most commonly item()
or expr()
) which can end up calling themselves, and so on. math() is a self-contained function that cannot end up calling itself, so it will just free the stack it used on return and that's it.
And xeq() reserves 300 !
I haven't done anything with xeq() yet, it's in bbexec.c which I'm looking at now.
And xeq() reserves 300 !
I've uploaded bbexec.zip in case you want to try it. It compiles (tested with GCC 8.1 and 10.3) and runs, although I haven't yet done extensive testing.
I'm also unsure whether I've caught all the large stack-guzzling culprits, so I will be interested to know how it compares with the 300 bytes of the old version.
Performance wise is it measurably slower than the full-fat stack, as one would expect, but not by so much that it should be of concern on the Pico. I would hope it's no slower than your approach.
This 'nested function' technique (albeit that it's GCC-only) does show promise as a relatively safe and elegant way of reducing the stack usage under control of a simple macro. If you're happy with it, it will hopefully eliminate the necessity to have custom versions of generic interpreter source files.
Certainly an improvement (84 bytes):
100113cc <xeq>:
{
100113cc: b5f0 push {r4, r5, r6, r7, lr}
100113ce: 46ce mov lr, r9
100113d0: 4647 mov r7, r8
100113d2: b580 push {r7, lr}
100113d4: b095 sub sp, #84 ; 0x54
100113d6: 9003 str r0, [sp, #12]
VAR xeq (void)
100113d8: ab1c add r3, sp, #112 ; 0x70
100113da: 930f str r3, [sp, #60] ; 0x3c
However my version does better (28 bytes):
10011ecc <xeq>:
xeq_default, // 127
};
// Execute the program:
VAR xeq (void)
{
10011ecc: b5f0 push {r4, r5, r6, r7, lr}
10011ece: b087 sub sp, #28
10011ed0: 9000 str r0, [sp, #0]
bFlgChk = true;
Also, the case statement in your version is implemented as nested IF statements:
switch (al)
10011410: 0018 movs r0, r3
10011412: 302c adds r0, #44 ; 0x2c
10011414: d101 bne.n 1001141a <xeq+0x4e>
10011416: f000 fed4 bl 100121c2 <xeq+0xdf6>
1001141a: 0018 movs r0, r3
1001141c: 302c adds r0, #44 ; 0x2c
1001141e: dd24 ble.n 1001146a <xeq+0x9e>
10011420: 0018 movs r0, r3
10011422: 3021 adds r0, #33 ; 0x21
10011424: d101 bne.n 1001142a <xeq+0x5e>
10011426: f000 fd77 bl 10011f18 <xeq+0xb4c>
1001142a: 001a movs r2, r3
1001142c: 3221 adds r2, #33 ; 0x21
1001142e: dc00 bgt.n 10011432 <xeq+0x66>
10011430: e18c b.n 1001174c <xeq+0x380>
10011432: 001a movs r2, r3
10011434: 321b adds r2, #27
10011436: dd00 ble.n 1001143a <xeq+0x6e>
10011438: e1f1 b.n 1001181e <xeq+0x452>
1001143a: 001a movs r2, r3
1001143c: 321c adds r2, #28
1001143e: db00 blt.n 10011442 <xeq+0x76>
10011440: e36b b.n 10011b1a <xeq+0x74e>
10011442: 001a movs r2, r3
10011444: 321f adds r2, #31
10011446: d100 bne.n 1001144a <xeq+0x7e>
10011448: e1ba b.n 100117c0 <xeq+0x3f4>
1001144a: 001a movs r2, r3
1001144c: 321f adds r2, #31
1001144e: da00 bge.n 10011452 <xeq+0x86>
10011450: e328 b.n 10011aa4 <xeq+0x6d8>
10011452: 001a movs r2, r3
10011454: 321e adds r2, #30
10011456: d101 bne.n 1001145c <xeq+0x90>
10011458: f000 fed4 bl 10012204 <xeq+0xe38>
1001145c: 331d adds r3, #29
1001145e: d001 beq.n 10011464 <xeq+0x98>
10011460: f000 ff55 bl 1001230e <xeq+0xf42>
... plus more sections like the above.
Whereas the indirect function call in my version is much more efficient:
xeq_list[(int)al_token + 0x80]();
10011f0a: 3080 adds r0, #128 ; 0x80
10011f0c: 0080 lsls r0, r0, #2
10011f0e: 4b45 ldr r3, [pc, #276] ; (10012024 <xeq+0x158>)
10011f10: 58c3 ldr r3, [r0, r3]
10011f12: 4798 blx r3
Whereas the indirect function call in my version is much more efficient:
I have sufficient confidence in GCC to think it is highly unlikely that your version is "much more efficient". Indeed if I was a betting man (fortunately I'm not!) my money would be on the switch statement.
In my view it's particularly important not to 'second guess' the compiler when targeting multiple platforms and processors, when the code generator will know far more about how to optimise performance on a specific CPU than you or I would.
This is a random comment that I found at Stack Overflow: "Trust the compiler, it will do such micro optimisations much better than you". Even if a compiler doesn't produce the best possible code now, a future version may do better.
Anyway, what matters is whether I've got the stack usage down to a level where you'd be happy to use the standard source files in the Pico build, which is very important to me. It doesn't mean efforts to reduce the stack usage further shouldn't be made, but I would at least have the confidence of knowing that when I (or you) create a Pico build it's using the latest source.
Certainly an improvement (84 bytes):
I can't see where that much stack is going. I'd expect enough to save the non-volatile registers but not much else (the only case
clauses which I haven't converted to function calls ought to be able to manage with registers rather than spilling variables to the stack).
I assume that if one case has for example an int a
and another an int b
(both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.
I've updated bbexec.zip; this has the TSOUND case additionally converted to a function call, although I wouldn't expect that to make much if any difference.
I have sufficient confidence in GCC to think it is highly unlikely that your version is "much more efficient". Indeed if I was a betting man (fortunately I'm not!) my money would be on the switch statement.
In my view it's particularly important not to 'second guess' the compiler when targeting multiple platforms and processors, when the code generator will know far more about how to optimise performance on a specific CPU than you or I would.
I am not 'second guessing' the compiler, I am looking at the disassembly of what the compiler has actually produced. That is a large mess of comparisons and branches. I think it is a loop unrolled version of a bisection search for the matching case value. That is less efficient than a direct indexed look up in a jump table, which is what my revised version is.
Changing the algorithm at the code level is a 'macro' optimisation, not a 'micro' one.
I can't see where that much stack is going. I'd expect enough to save the non-volatile registers but not much else (the only
case
clauses which I haven't converted to function calls ought to be able to manage with registers rather than spilling variables to the stack).I assume that if one case has for example an
int a
and another anint b
(both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.
I currently believe that the compiler is exactly that stupid. Which is why I think it worthwhile to convert every case clause into a separate function, no matter how small.
I am currently working on my alternative approach of a program that will automatically convert the case statement into multiple functions.
That is less efficient than a direct indexed look up in a jump table, which is what my revised version is.
You can't possibly say that, with a modern processor. Branch prediction and the like can make naïve assessments of that kind highly misleading. I have no doubt that GCC, especially relatively new versions, will be doing an excellent job of optimising the switch statement.
As an assembly language programmer for something like 50 years, I know that the kind of hand-optimisation which I used to do in the old days (and I think I was pretty good at) just doesn't apply to modern CPUs, and I accept that a good compiler will do a better job than I ever could.
I am currently working on my alternative approach of a program that will automatically convert the case statement into multiple functions.
Please don't waste your time. The only acceptable solution as far as I'm concerned (and I'm sure you would feel the same in my position) is for the Pico build to use the same source files as are used for all the others. I have a put a lot of effort into adding an option to reduce stack size, and even if it's not quite as efficient as yours in that regard I am sure it will be fine.
I would ask you please to agree to using my latest sources. I will then do a much more comprehensive set of tests than I have to date and commit them to the repository. Thanks.
Very well. I accept your revisions.
If I may ask for one further conditional block in bbexec.c, as follows:
/******************************* TIME and TIME$ ********************************/
case TTIMEL:
#ifdef PICO_RTC
if ( *esi == '$' )
{
++esi;
equals ();
VAR v = exprs ();
putims (v.s.p + zero);
}
else
#endif
{
long long n ;
equals () ;
n = expri () ;
putime (n) ;
}
break ;
If you could either post your final versions of bbeval.c and bbexec.c here, or let me know when you have updated your repository, so that I can revise my makefile accordingly.
Very well. I accept your revisions.
Thank you, that is much appreciated.
If I may ask for one further conditional block in bbexec.c, as follows:
Of course, only too happy to. Could we choose something a little less specific than PICO_RTC, perhaps SET_RTC, because I can envisage other systems wanting this capability too.
If you could either post your final versions of bbeval.c and bbexec.c here, or let me know when you have updated your repository, so that I can revise my makefile accordingly.
Certainly. It may take a little while, because I want to do as comprehensive a set of tests as I can. I am very jittery about modifying such a key part of the interpreter.
Just for your interest, and perhaps to help you understand a little better where I am coming from, here's the equivalent code to xeq()
from my original Z80 BBC BASIC of 1981:
ADD A,A
LD C,A
LD B,0
LD HL,CMDTAB
ADD HL,BC
LD A,(HL) ;TABLE ENTRY
INC HL
LD H,(HL)
LD L,A
JP (HL) ;EXECUTE STATEMENT
And here is the equivalent code from BBC BASIC for Windows, 20 years later:
movzx ebx,al
jmp [ebx*4+cmdtab]
So I don't need any persuading of the benefits of dispatching via a jump table! Indeed, when I was converting the code to C, ten years or so ago, my initial inclination was to do exactly the same. But because I was very concerned to get the best possible performance I did some research into whether a switch
statement might be even better.
The great majority of the references I could find at the time said yes: trust the compiler. If a dispatch table was the best way, that's what it would use. I have never had cause to regret that decision.
If I may ask for one further conditional block in bbexec.c, as follows:
/******************************* TIME and TIME$ ********************************/ case TTIMEL: #ifdef PICO_RTC if ( *esi == '$' ) { ++esi; equals (); VAR v = exprs (); putims (v.s.p + zero); } else #endif
I wonder, do you realise that the string at v.s.p + zero
is not NUL-terminated, or if so only by chance? To get a terminated string you would have to call fixs()
(which puts a CR-terminated string in the 'string accumulator') or similar, alternatively pass both the pointer v.s.p
and the length v.s.l
of the string to putims()
, or indeed pass the entire structure. What shall I do?
Of course, only too happy to. Could we choose something a little less specific than PICO_RTC, perhaps SET_RTC, because I can envisage other systems wanting this capability too.
I suggest HAVE_RTC instead of PICO_RTC.
I wonder, do you realise that the string at
v.s.p + zero
is not NUL-terminated, or if so only by chance? To get a terminated string you would have to callfixs()
(which puts a CR-terminated string in the 'string accumulator') or similar, alternatively pass both the pointerv.s.p
and the lengthv.s.l
of the string toputims()
, or indeed pass the entire structure. What shall I do?
Yes, I know this. However since the permitted formats for the time string are specified I don't need the NUL terminator to know the string length. And as per the manual, if the string does not correspond to the defined format, the results are undefined.
The parsing code is:
void putims (const char *psTime)
{
datetime_t dt;
rtc_get_datetime (&dt);
const char *ps = psTime;
if ( ps[3] == '.' ) ps += 4;
if ( ps[2] != ':' )
{
dt.day = 10 * ( ps[0] - '0' ) + ps[1] - '0';
ps += 3;
for (int i = 1; i <= 12; ++i)
{
if ( strncasecmp (ps, psMon[i], 3) == 0 )
{
dt.month = i;
break;
}
}
ps += 4;
dt.year = 1000 * ( ps[0] - '0' ) + 100 * ( ps[1] - '0' ) + 10 * ( ps[2] - '0' ) + ps[3] - '0';
int iDay = dt.year + dt.year / 4 - dt.year / 100 + dt.year / 400 + iDMon[dt.month-1] + dt.day - 1;
if (( dt.month < 3 ) && ( dt.year % 4 == 0 ) && (( dt.year % 100 != 0 ) || ( dt.year % 400 == 0 ))) --iDay;
dt.dotw = iDay % 7;
ps += 5;
}
if ( ps[2] == ':' )
{
dt.hour = 10 * ( ps[0] - '0' ) + ps[1] - '0';
ps += 3;
dt.min = 10 * ( ps[0] - '0' ) + ps[1] - '0';
if ( ps[2] == ':' )
{
ps += 3;
dt.sec = 10 * ( ps[0] - '0' ) + ps[1] - '0';
}
else
{
dt.sec = 0;
}
}
vldtim (&dt);
rtc_set_datetime (&dt);
}
You can't possibly say that, with a modern processor. Branch prediction and the like can make naïve assessments of that kind highly misleading. I have no doubt that GCC, especially relatively new versions, will be doing an excellent job of optimising the switch statement.
For modern high-spec processors you are probably correct. However, the Pico has ARM M0+ cores. I don't believe they have branch prediction. Based upon the disassembly of the object code, the compiler has generated ~350 instructions to implement the switch statement and select which case to execute. That compares to just 5 for the jump table. The only way I can imagine the switch code being faster is if all 350 instructions are in cache (and they are not all consecutive) while the jump table is not.
I suggest HAVE_RTC instead of PICO_RTC.
OK, although since reading TIME$ (which also must 'have' a real-time clock) is not conditional on such a test, it's a rather inconsistent usage.
Yes, I know this. However since the permitted formats for the time string are specified I don't need the NUL terminator to know the string length.
The 'permitted' formats may be specified, but if somebody enters a shorter string than they should your code could read beyond the end, potentially reading garbage or even unallocated memory. Indeed if the length of the string is zero the pointer can legitimately be NULL and your code will crash with a NULL-pointer exception!
I propose that I add a call to fixs()
and pass a CR-terminated string, that would retain compatibility with your existing parsing code.
Based upon the disassembly of the object code, the compiler has generated ~350 instructions to implement the switch statement and select which case to execute.
Which optimisation level were you using?
It makes no sense to me that GCC, which if -O2 is specified usually goes to such lengths to optimise switch statements (some such optimisations being quite amazing, look it up) and will in any case default to using a jump-table (again, look it up), would fail so spectacularly on the Cortex-M0+.
But if for some extraordinary reason it does, it's not for us to work around what is clearly a fault in the compiler or code generator. Report it by all means, if it's really doing that somebody responsible for the compiler needs to know about it, but it's not our problem.
I suggest HAVE_RTC instead of PICO_RTC.
OK, although since reading TIME$ (which also must 'have' a real-time clock) is not conditional on such a test, it's a rather inconsistent usage.
In which case use CAN_SET_RTC.
I propose that I add a call to
fixs()
and pass a CR-terminated string, that would retain compatibility with your existing parsing code.
Instead, please pass the string length. It is easier for me to use that for checking.
Which optimisation level were you using?
All the PicoBB builds use -O1. That is what Eric concluded was best. I believe that choice was based upon restricting stack utilisation.
All the PicoBB builds use -O1
Try -O2 with the new stack-reduction code. The noinline
attribute should override the -O2 and help keep stack growth down.
All the PicoBB builds use -O1
Try -O2 with the new stack-reduction code. The
noinline
attribute should override the -O2 and help keep stack growth down.
Still the same mess of compare and branch for the main switch statement in xeq(). It would take a lot of work to determine what all the differences are.
It would take a lot of work to determine what all the differences are.
In every other build I use -O2 for the core interpreter because it provides a worthwhile performance improvement, and especially on the Pico every little helps.
If the only reason for using -O1 was because of excessive stack growth (presumably through inlining), it might be worth tracing the critical recursive call chain and adding some noinline
attributes, conditional on the new REDUCE_STACK_SIZE option.
This would probably be a fair amount of work, but if we could get the stack down to something comparable with what -O1 gives, the -O2 optimisations would almost certainly provide a worthwhile benefit, especially in those functions - the vast majority - which never get called recursively.
I expect very few, if any, other BASIC interpreters bother with stack reduction in this way, so it would give BBC BASIC another string to its bow to offer improved compatibility with platforms with limited RAM.
Do you still have a way of measuring overall stack usage (e.g. by counting levels of recursion in a BASIC program before it blows up) that could be used to compare your existing version, the REDUCE_STACK_SIZE version compiled with -O1 and the same compiled with -O2? That would give us some statistics to judge the potential value of more work.
Seems I've discovered why you've been using -O1! Running speed.bbc:
Existing version from this repository: Combined average 335 MHz. My new bbexec.c & bbeval.c with -O1: Combined average 327 MHz My new bbexec.c & bbeval.c with -O2: Combined average 298 MHz My new bbexec.c & bbeval.c with -O3: Combined average 317 MHz
Definite room for improvement in the Cortex-M0+ code generator I would say. ;-)
I've been doing a more detailed analysis of the stack usage issue, and it turns out that reducing the size of the stack frame reserved by xeq()
or item()
will not necessarily increase the maximum depth of recursion! It depends very much on the specific recursive BASIC code concerned.
For example the LET (assignment) statement uses quite a lot of CPU stack, so moving it into a subroutine may well reduce the size of the stack frame allocated by xeq()
. But if, in the BASIC program, the recursion happens in an assignment statement, which is quite likely, the amount of stack used for each level of recursion may actually increase, not decrease!
This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!
It will always be beneficial to move into a subroutine statements which use a lot of stack but never call item()
or expr()
etc. and thus cannot be in the recursive 'chain'. But moving statements that do call item()
etc. into a subroutine may actually make things worse, not better.
This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!
I don't understand this.
With my version of the code, xeqTLET uses 112 bytes of stack. And the cut-down xeq uses 48 bytes, so a total of 160 bytes. That compares with 328 bytes used by your original xeq. The xeq* routine with the largest stack usage I could find is xeq_TSYS which uses 196 bytes of stack. Adding that used by the residual xeq gives a total of 244 bytes of stack, only 75% of your original.
Considering the case of recursive calls once the individual cases have been moved into subroutines those case routines will use a fixed amount of stack. But each recursion will use the case routine, plus the residual base xeq routine. Hence the stack used by the residual routine should be made as small as possible.
The only cases which should be left inline in xeq are those small ones which result in no increase in the stack usage of the xeq routine itself, but would require a few bytes of stack for the subroutine call if moved out.
This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!
I don't understand this.
It's true, nevertheless.
It's easily demonstrated (although time consuming) by writing some test programs that are recursive in different ways - e.g. recursive in the 'end function' (=) statement, or recursive in an assignment, or recursive in a PRINT or whatever - and then experimenting with making each of those statements into a subroutine or not.
I did that yesterday using this program, which is recursive in the 'end function' statement:
PRINT FN1(0)
END
DEF FN1(A%)
PRINT A%
= FN1(A% + 1)
Moving the ENDFN clause back to inline code in xeq()
rather than a subroutine increased the maximum nesting depth!
There's no perfect answer to this. You could try to analyse lots of recursive programs to see what statement(s) they are most commonly recursive in, but it's not very scientific.
ENDFN is somewhat a special case, in that it finishes with a return, rather than continuing around the while loop. This meant that my bbexec2.c had to deal with that case separately. As a result the compiler has inlined the function anyway.
I would be interested in seeing the disassembly, with source code, of your two versions with and without ENDFN inlined.
The SDK does not include source code in the disassembly it produces. However you can produce a disassembly with source code using the command:
arm-none-eabi-objdump -Sd build/bbcbasic.elf >build/bbcbasic.dis
Alternately, please will you post your current version of bbexec.c here so that I can do the experiment.
ENDFN is somewhat a special case, in that it finishes with a return
That's unimportant (my 'nested function' approach can handle a non-void function just as easily as a void one). What matters is that it calls expr()
and is therefore in the recursive 'chain': its stack frame contributes to the 'stack usage per level of recursion'.
To prove the point, change the recursive statement to something other than ENDFN, for example an assignment:
PRINT FN1(0)
END
DEF FN1(A%)
PRINT A%
A% = FN1(A% + 1)
= A%
This is functionally equivalent to the previous example, but now inlining ENDFN won't reduce the overall stack usage as it did before, it will either increase it or make no difference. However inlining LET, which would have had a negative effect on the previous example will now be beneficial!
Without knowing whether recursion is more common in ENDFN or an assignment (or something else) we can't determine which set of cases to make a subroutine and which to leave inline for optimal results. :-(
I would be interested in seeing the disassembly, with source code, of your two versions with and without ENDFN inlined.
I'm inclined to think that looking at a disassembly can be misleading, it's what led you to believe that reducing the size of the xeq()
stack frame is always beneficial, when it isn't. I'd rather use the more direct and meaningful measure of the number of recursions, of a specific test program, before it runs out of stack.
Alternately, please will you post your current version of bbexec.c here so that I can do the experiment.
There is no "current version", I am actively modifying it to see what the trade-offs are. There's an unfortunate complication that an innocuous change to bbexec.c which can't possibly have any direct effect on speed.bbc is in practice having a very significant impact. I've seen this kind of thing before with 32-bit x86 (especially AMD CPUs which fetch code in 32-byte chunks and are therefore exquisitely sensitive to code alignment).
I don't know whether a similar thing is happening with the ARM Cortex-M0+ but it means that a tiny change to the code which may be beneficial in respect of stack usage may have as much as a 20% impact on speed. This is very annoying, especially as the GCC code generator should know about things like alignment and ensure that this kind of side-effect doesn't occur.
In the absence of real code I will have to use hypothetical values.
Suppose that xeq_ENDFN as a separate routine uses 16 bytes of stack. Suppose that inlining the code in the base xeq routine saves those 16 bytes at the expense of increasing the stack usage of xeq by 8 bytes. Then if you run your ENDFN test program you save a net 8 bytes per iteration, and can perform more iterations before running out of stack space, as your test shows.
However, suppose you now use your ENDFN optimised code to run any other test iteration (not using ENDFN). You now just get the extra 8 bytes of stack utilisation per iteration from the enlarged xeq routine, so can perform fewer iterations before running out of stack than with the 'unoptimised' code.
The same is true in general. Inlining one case may increase the number of iterations you can do with code that tests that case, but if it makes any increase in the stack utilised by the base xeq routine this will come at the expense of reducing the performance of all iterations that do not use that case.
Any 'optimisation' which improves one specific case at the expense of making all others worse does not seem like a good optimisation.
Any 'optimisation' which improves one specific case at the expense of making all others worse does not seem like a good optimisation.
I agree, but it's not really a choice of 'one specific case out of many', the great majority of recursive programs are likely to recurse in only a tiny subset of the possible statements. To take a (silly) example how many programs are likely to recurse in the STEP clause of a FOR statement:
FOR i% = 1 TO 1000 STEP FNrecurse
It's possible, but so unlikely that I don't think it's worthy of consideration or optimisation. I reckon nearly all recursive programs will recurse either in an assignment or in the end-function statement.
With that assumption I've written two test programs (they are the ones I've listed before), one which recurses in an assignment:
PRINT FN1(0)
END
DEF FN1(A%)
PRINT A%
A% = FN1(A% + 1)
= A%
and one which recurses in the end-function statement:
PRINT FN1(0)
END
DEF FN1(A%)
PRINT A%
= FN1(A% + 1)
With only two likely possibilities you don't have to make a choice of which you optimise, you can attempt to achieve a compromise which achieves good results for both. This is what I am currently working on.
The main advance which I think I've made is that the 'nested function' approach (unlike your table dispatch approach) provides the opportunity to put part of a case
clause in a subroutine whilst keeping part of it inline. That gives me more flexibility, especially with the LET case.
I assume that if one case has for example an
int a
and another anint b
(both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.I currently believe that the compiler is exactly that stupid.
I'm pleased to say it's not. Inlining a case
clause doesn't result in any increase in the size of the stack-frame, so long as its local variable storage can 'overlay' that of another case. The compiler uses the same space for both.
This is good because it means I can inline the ENDFN clause without any negative impact on stack use, and indeed doing so increases the maximum nesting depth of the program which recurses in that statement, without having any effect at all on the one which recurses in an assignment.
So it's a win-win situation, or at least it would be if that change didn't also reduce the performance reported by speed.bbc by about 8%! There's no way that inlining that clause should legitimately have such an impact on speed, if anything I would expect removing the overhead of a function call and return to increase the speed.
It's very frustrating!
I've committed the latest versions of bbexec.c and bbeval.c to my repository. The performance I'm getting with these is as follows (figures in brackets are for your latest bbexec2.c and bbeval2.c):
So the stack usage is a little more than yours, but it's about 3% faster - which I think puts to bed the idea that the table-dispatch method is "much more efficient".
The really big advantage, for me, is that the code is identical to that which all the other editions use, except that when REDUCE_STACK_SIZE is specified some of it gets moved into nested functions to reduce stack usage. I was pleased to find that even goto
still works, because it's legal to jump out of a nested function to a local label.
I think (hope) the issue can be closed now.
Very well.
I have already committed the changes to the Make and CMake files to by default compile using your version of bbeval.c and bbexec.c with REDUCE_STACK_SIZE defined.
I'm not sure whether this is your issue or mine, but the Pico version of BBC BASIC is reporting 'Number too big' from this statement:
PRINT 1/0
whereas I would have expected 'Division by zero'. The other editions (including ARM editions) built from the same source code work as expected, so something strange is happening. The relevant code is this in bbeval.c:
There's no way that op can be anything other than '/' that I can see from the code.