Memotech-Bill / PicoBB

BBC BASIC for Raspberry Pi Pico
zlib License
34 stars 5 forks source link

'Division by zero' giving 'Number too big' #15

Closed rtrussell closed 1 year ago

rtrussell commented 1 year ago

I'm not sure whether this is your issue or mine, but the Pico version of BBC BASIC is reporting 'Number too big' from this statement:

PRINT 1/0

whereas I would have expected 'Division by zero'. The other editions (including ARM editions) built from the same source code work as expected, so something strange is happening. The relevant code is this in bbeval.c:

if (isinf(x.f) || isnan(x.f) || (x.i.t == -1) || errno)
    {
    if (op == '/')
        error (18, NULL) ; // 'Division by zero'
    else if (op == '^')
        error (22, NULL) ; // 'Logarithm range'
    else
        error (20, NULL) ; // 'Number too big'
    }

There's no way that op can be anything other than '/' that I can see from the code.

Memotech-Bill commented 1 year ago

In order to minimise stack usage and leave the maximum amount of memory free for the BASIC program, my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c in which some large case statements have been split into multiple subroutines. To use the original code, the program would have to be re-compiled using the option MIN_STACK=N.

As you observed, the routine math_divide was reporting a "Number too big" error rather than "Division by zero". I have now committed a change to test for a zero divide and issue the correct error message in this case.

I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.

I have not yet posted updated UF2 files.

rtrussell commented 1 year ago

my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c

I see. My understanding was that you were using BBCSDL as a submodule, so I could be confident that any changes I made to my repository (such as the recent revamping of the way errno is used, to make it possible for BASIC programs to read that value) would be automatically incorporated in the Pico build. Have you been careful to track changes I have made and incorporate them in your versions?

My users are entitled to expect that all the Console Mode editions of BBC BASIC available from my web site are built from the same source code. Indeed I have unwittingly stated that explicitly in release notes, believing it to be the case. So can you tell me exactly what I need to change in my build process to ensure that I am, indeed, using the original source files? I'm sure any increased stack usage won't matter in the vast majority of BASIC programs.

I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.

I don't consider that an "incorrect" report. There are no practical circumstances when it matters whether the divisor is precisely zero or not, all that matters is that the quotient overflows the number range of the output. Indeed in versions of BBC BASIC which use 40-bit floats 1E-200 is precisely zero, but in versions which use 64-bit floats it isn't. Similarly 1E-400 is precisely zero in versions which use 64-bit floats, but isn't when 80-bit floats are used.

Memotech-Bill commented 1 year ago

my default builds do not use bbeeval.c (or bbexec.c). Instead they use re-coded versions bbeval2.c and bbexec2.c

So can you tell me exactly what I need to change in my build process to ensure that I am, indeed, using the original source files? I'm sure any increased stack usage won't matter in the vast majority of BASIC programs.

To build using your versions of bbeval.c and bbexec.c, build use the following command:

make MIN_STACK=N

For confirmation the CMake output should show

-- BBC BASIC Upstream source from ../../BBCSDL
-- Using upstream expression evaluation code                             -- CONFIRMATION
-- Using memory protection to detect stack overrun
-- Pico W GPIO (LED) only support
-- Generate high quality sound using PWM on core 1
-- Including LFS filesystem for flash storage
-- BBC Basic console I/O will be on USB or UART
-- Link with 2KB multicore stack
-- Configuring done
-- Generating done

I see. My understanding was that you were using BBCSDL as a submodule, so I could be confident that any changes I made to my repository (such as the recent revamping of the way errno is used, to make it possible for BASIC programs to read that value) would be automatically incorporated in the Pico build. Have you been careful to track changes I have made and incorporate them in your versions?

Yes I have been tracking your changes.

My users are entitled to expect that all the Console Mode editions of BBC BASIC available from my web site are built from the same source code. Indeed I have unwittingly stated that explicitly in release notes, believing it to be the case.

The Pico code (even the console builds) is clearly not the same as BBCSDL:

I will point out that the code in bbeval.c will incorrectly report "Division by zero" for PRINT 1E200 / 1E-200, whereas the revised code in bbeval2.c still reports "Number too big" for this case.

I don't consider that an "incorrect" report. There are no practical circumstances when it matters whether the divisor is precisely zero or not, all that matters is that the quotient overflows the number range of the output.

If the output overflows the number range then surely the correct error is "Number too big". And infinity could be considered to be too big,

Indeed in versions of BBC BASIC which use 40-bit floats 1E-200 is precisely zero, but in versions which use 64-bit floats it isn't. Similarly 1E-400 is precisely zero in versions which use 64-bit floats, but isn't when 80-bit floats are used.

Which illustrates that not all versions of BBC BASIC behave identically.

Whether 1E-200 can be considered to be effectively zero depends upon the application, which you do not know. With 64 bit floats there are more than 10^17 representable values between zero and 1E-200, spanning more than 100 orders of magnitude,

rtrussell commented 1 year ago

The Pico code (even the console builds) is clearly not the same as BBCSDL:

I hope you are not being deliberately perverse just to be irritating! :-(

When I say that I expect different editions to use the same source code I am of course referring to the generic source code of the BBC BASIC interpreter. If you look at the README of my BBCSDL repository you will find an architecture diagram which shows the generic modules and those that are specific to each platform and type of build.

I would expect all versions (at least, those which call themselves 'BBC BASIC') to use the files shown in green, and all the Console versions to use the files shown in the red box. The Pico edition is specifically catered for by the inclusion of the file bbpico.c in grey. Of course there may well be a few instances of #ifdef PICO directives in what are otherwise generic modules, but that's no different in principle from the #ifdef __WINDOWS__ or #ifdef __aarch64__ etc. directives that are inevitable.

The issue of stack growth is a tricky one, not least because (as I understand it) it arises because of poor compiler design, in lumping the stack requirements of all the clauses in a switch statement into one large stack frame reserved at the start of the function rather than individually for each clause. One of the things always emphasised in C coding guidelines is never to try to 'outwit' the compiler, so for example always use the native switch statement rather than a jump table.

Compilers have got better in their stack handling, for example a function with an 'early out' may well now not reserve its full stack frame until after that return which is an obvious efficiency benefit that any assembly-language programmer would automatically take advantage of. If the claim that modern compilers can outperform human assembly language programmers is to be justified, they will also have to allocate stack more intelligently in functions containing large switch statements.

But meanwhile I'm not sure what the best approach is. If the changes to reduce stack usage (at the expense of speed) are sufficiently minor that they can be covered with an #ifdef FAVOUR_STACK_SIZE_OVER_SPEED clause then I'd consider incorporating them in the generic modules.

The way I always do my releases is that I rebuild all the editions from the revised source(s) except the Pico, test them on their respective platforms, upload the binaries to my website and only then commit the changes to my GitHub repository (so I don't commit buggy code). I then clone and build the Pico edition, knowing that it will pick up the changes in my commit; but that means there may well be only minutes between committing the revised source files and the build, so obviously there is no opportunity for you to 'step in' and incorporate any changes I may have made.

I don't see how else this can work, so if I've made changes to any of the generic modules I need to know that they will be incorporated in the Pico build as well as all the others. Perhaps you can give some thought to this and advise on how you think we should best proceed. I don't want to have to omit the Pico edition from those that are available as binary downloads from my site.

rtrussell commented 1 year ago

allocate stack more intelligently in functions containing large switch statements.

Here's an idea; I don't know if it will work but if it does it's potentially a very elegant way of forcing the compiler to allocate a stack frame 'local' to the individual case clause rather than during the initialisation of the parent function containing the switch statement.

GCC supports nested functions (Clang/LLVM doesn't); using this extension it's possible to declare a function inside another function. So, using a simple #if condition we can convert the existing case clause to a function call thus:

        case TFN:
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
        nestedTFN();
        void __attribute__ ((noinline)) nestedTFN(void)
#endif
        {
            int errcode = 0 ;
            jmp_buf savenv ;
            heapptr* savesp ;
                ...
        }

(not shown here is the function prototype necessary to avoid an error as a result of the forward-reference).

If the #if condition evaluates true the existing conventional inline case clause is converted to a function call. Hopefully, as a result, any stack requirements of that clause will be allocated locally, not in the parent function.

I've demonstrated that this compiles and runs, but what I don't know is whether it reduces the stack usage for real. I also don't know whether it optimises out the actual function call (but even if it doesn't, unconditional function calls with no parameters often take zero cycles on a modern pipelined CPU).

If it works well, I would be entirely happy to incorporate it in my 'master' source files because the code would be completely unchanged except for the REDUCE_STACK_SIZE case.

Memotech-Bill commented 1 year ago

The way I always do my releases is that I rebuild all the editions from the revised source(s) except the Pico, test them on their respective platforms, upload the binaries to my website and only then commit the changes to my GitHub repository (so I don't commit buggy code). I then clone and build the Pico edition, knowing that it will pick up the changes in my commit; but that means there may well be only minutes between committing the revised source files and the build, so obviously there is no opportunity for you to 'step in' and incorporate any changes I may have made.

As I have remarked previously, you don't necessarily have to push your source to GitHub before building the Pico version. You can do:

make BBC_SRC=<Path to local BBCSDL folder>

This will use the source in the specified folder instead of that from GitHub.

The issue of stack growth is a tricky one, not least because (as I understand it) it arises because of poor compiler design, in lumping the stack requirements of all the clauses in a switch statement into one large stack frame reserved at the start of the function rather than individually for each clause. One of the things always emphasised in C coding guidelines is never to try to 'outwit' the compiler, so for example always use the native switch statement rather than a jump table.

The problem is what target the compiler is optimising for. Certainly GCC seems to be aimed primarily at systems with large amounts of memory. Optimising for size gets less attention. Indeed when Eric tried using -Os, compiler bugs caused the program to crash. I have not tried using -Os recently.

Here's an idea; I don't know if it will work but if it does it's potentially a very elegant way of forcing the compiler to allocate a stack frame 'local' to the individual case clause rather than during the initialisation of the parent function containing the switch statement.

The only way to find out would be to modify all of at least one large case statement in this fashion, compile it then look at the disassembly to see how much stack space is being allocated.

In a similar way, I was wondering whether it would be possible to automate the generation of bbeval2.c and bbexec2.c from the original bbeval.c and bbexec.c. Mostly each case clause is simply copied into a separate function body. By putting some specific comment statements in the original source to mark key locations it may be possible to achieve this.

rtrussell commented 1 year ago

As I have remarked previously, you don't necessarily have to push your source to GitHub before building the Pico version.

The 'local' BBCSDL folder isn't on the Raspberry Pi that I use to build the Pico, nor on a network drive. But anyway it only makes the issue at hand even worse - you then wouldn't have even a few minutes to update your custom versions with whatever changes I had made, because you wouldn't even know I had made any changes! :-)

Optimising for size gets less attention. Indeed when Eric tried using -Os, compiler bugs caused the program to crash. I have not tried using -Os recently.

I'm pretty certain that's to reduce code size, and doesn't reduce stack usage at all.

The only way to find out would be to modify all of at least one large case statement in this fashion

I'm happy to do that, but I would need to know which case clauses need the treatment and which don't (many have no local stack requirements at all, over and above what's needed for the function preamble, I would expect). Indeed one advantage of this method is that one can choose any subset to be turned into function calls without affecting the rest.

Do you have any statistics, based on tests you might have done, on which are the greatest culprits in stack use?

Memotech-Bill commented 1 year ago

I have attempted to revise routine item () along the lines you suggest.

#define REDUCE_STACK_SIZE

VAR item (void)
{
    VAR v ;
    signed char al = nxt () ;
    esi++ ;
    switch (al)
        {

/************************* Parenthesised expression ****************************/

        case '(':
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
        VAR __attribute__ ((noinline)) item_OBKT(void)
            {
#endif
            v = expr () ;
            if (nxt () != ')')
                error (27, NULL) ; // 'Missing )'
            esi++ ;
            return v ;
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
            }
        return item_OBKT ();
#endif

/************************************* FN **************************************/

        case TFN:
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
        VAR __attribute__ ((noinline)) item_TFN(void)
            {
#endif
            {
            int errcode = 0 ;
            jmp_buf savenv ;
            heapptr* savesp ;

            procfn (TFN) ;

            memcpy (&savenv, &env, sizeof(env)) ;
            savesp = esp;
            errcode = (setjmp (env)) ; // <0 : QUIT, >0 : error, 256: END
            if (errcode)
                {
                if ((errcode < 0) || (errcode > 255) || (errtrp == 0) ||
                    (onersp == NULL) || (onersp > savesp))
                    {
                    memcpy (&env, &savenv, sizeof(env)) ;
                    longjmp (env, errcode) ;
                    }
                esi = errtrp + (signed char *) zero ;
                esp = onersp ;
                }

            v = xeq () ;
            memcpy (&env, &savenv, sizeof(env)) ;
            }
            return v ;
#if defined(REDUCE_STACK_SIZE) && defined(__GNUC__) && !defined(__llvm__)
            }
        return item_TFN ();
#endif

Full version attached.

However I am obtaining build errors:

/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:997:3: error: a label can only be part of a statement and a declaration is not a statement
   VAR __attribute__ ((noinline)) item_OBKT(void)
   ^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:998:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
             {
             ^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: warning: implicit declaration of function 'item_OBKT' [-Wimplicit-function-declaration]
   return item_OBKT ();
          ^~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
   return item_OBKT ();
          ^~~~~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1014:3: error: a label can only be part of a statement and a declaration is not a statement
   VAR __attribute__ ((noinline)) item_TFN(void)
   ^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1015:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
             {
             ^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: warning: implicit declaration of function 'item_TFN'; did you mean 'itemi'? [-Wimplicit-function-declaration]
   return item_TFN ();
          ^~~~~~~~
          itemi
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
   return item_TFN ();
          ^~~~~~~~~~~

Not sure what I am doing wrong.

Edit: Found the problem. Need a Brace after the case statement and before the nested function.

bbeval.zip

Memotech-Bill commented 1 year ago

I'm happy to do that, but I would need to know which case clauses need the treatment and which don't (many have no local stack requirements at all, over and above what's needed for the function preamble, I would expect). Indeed one advantage of this method is that one can choose any subset to be turned into function calls without affecting the rest.

Do you have any statistics, based on tests you might have done, on which are the greatest culprits in stack use?

The routines which I have split into multiple functions are:

I don't have any statistics on the stack usage of individual case clauses. All that the disassembly gives is the total stack usage of the function. I suppose that the disassembly of my revised code will give the stack usage of the separated functions, which might give an indication of the stack used by the original case clauses.

Memotech-Bill commented 1 year ago

It looks as though this works. For the original code, the disassembly shows:

100137f4 <item>:
{
100137f4:   b5f0        push    {r4, r5, r6, r7, lr}
100137f6:   46ce        mov lr, r9
100137f8:   4647        mov r7, r8
100137fa:   b580        push    {r7, lr}
100137fc:   b0f5        sub sp, #468    ; 0x1d4
100137fe:   9007        str r0, [sp, #28]

So 468 bytes of stack reserved.

With the revised code (attached):

10014898 <item>:
{
10014898:   b5f0        push    {r4, r5, r6, r7, lr}
1001489a:   b08b        sub sp, #44 ; 0x2c
1001489c:   0004        movs    r4, r0

So only 44 bytes. There is then some extra stack used by the separate cases, the largest I have found is an additional 68 bytes:

100158f8 <item_TRAD.7314>:
            {
100158f8:   b530        push    {r4, r5, lr}
100158fa:   b091        sub sp, #68 ; 0x44
100158fc:   0005        movs    r5, r0
            xrad.i.n = 0x3F91DF46A2529D39L ;

I leave math() and xeq() to you.

bbeval.zip

rtrussell commented 1 year ago

I have attempted to revise routine item () along the lines you suggest.

Oh, it sounds as though we have both been working on it! Please try my version which has the top 20-or-so biggest culprits for stack usage in item() changed to function calls (for test purposes only REDUCE_STACK_SIZE is defined). Sorry you have wasted your time, I had no expectation you would work on it yourself.

I leave math() and xeq() to you.

I don't believe there is anything worth doing in math(), the most you could save would be 8 bytes of stack. And because math() doesn't call item() or expr() or anything similar the stack usage is not cumulative in a re-entrant function.

I'll have a look at xeq(), but I doubt that there are many statements that are likely to need treatment.

Memotech-Bill commented 1 year ago

pi@raspberrypi:~/pico/PicoBB/BBCSDL/src $ wget http://www.rtr.myzen.co.uk/bbeval.c --2023-03-25 17:22:59-- http://www.rtr.myzen.co.uk/bbeval.c Resolving www.rtr.myzen.co.uk (www.rtr.myzen.co.uk)... 212.23.8.80 Connecting to www.rtr.myzen.co.uk (www.rtr.myzen.co.uk)|212.23.8.80|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2023-03-25 17:22:59 ERROR 404: Not Found.

rtrussell commented 1 year ago

ERROR 404: Not Found.

Strange. I've re-uploaded it as a zip, perhaps that will work better.

Memotech-Bill commented 1 year ago
10012a34 <math>:

VAR math (VAR x, signed char op, VAR y)
{
10012a34:   b084        sub sp, #16
10012a36:   b5f0        push    {r4, r5, r6, r7, lr}
10012a38:   46ce        mov lr, r9
10012a3a:   4647        mov r7, r8
10012a3c:   b580        push    {r7, lr}
10012a3e:   b097        sub sp, #92 ; 0x5c
10012a40:   0007        movs    r7, r0
10012a42:   911f        str r1, [sp, #124]  ; 0x7c
10012a44:   9220        str r2, [sp, #128]  ; 0x80
10012a46:   9321        str r3, [sp, #132]  ; 0x84
10012a48:   ab23        add r3, sp, #140    ; 0x8c
10012a4a:   2600        movs    r6, #0
10012a4c:   579e        ldrsb   r6, [r3, r6]
    errno = 0 ;

math() reserves 92 bytes

1000e9b8 <xeq>:

// Execute the program:
VAR xeq (void)
{
1000e9b8:   b5f0        push    {r4, r5, r6, r7, lr}
1000e9ba:   46ce        mov lr, r9
1000e9bc:   4647        mov r7, r8
1000e9be:   b580        push    {r7, lr}
1000e9c0:   b0cb        sub sp, #300    ; 0x12c
1000e9c2:   900b        str r0, [sp, #44]   ; 0x2c
1000e9c4:   e141        b.n 1000ec4a <xeq+0x292>
    void *tmpesi ;
    while (1) // for each statement

And xeq() reserves 300 !

Memotech-Bill commented 1 year ago

Many errors of the same form as I had earlier:

   VAR __attribute__ ((noinline)) item_OBKT(void)
   ^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:998:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
             {
             ^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: warning: implicit declaration of function 'item_OBKT' [-Wimplicit-function-declaration]
   return item_OBKT ();
          ^~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1007:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
   return item_OBKT ();
          ^~~~~~~~~~~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1014:3: error: a label can only be part of a statement and a declaration is not a statement
   VAR __attribute__ ((noinline)) item_TFN(void)
   ^~~
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1015:13: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
             {
             ^
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: warning: implicit declaration of function 'item_TFN'; did you mean 'itemi'? [-Wimplicit-function-declaration]
   return item_TFN ();
          ^~~~~~~~
          itemi
/home/pi/pico/PicoBB/BBCSDL/src/bbeval.c:1045:10: error: incompatible types when returning type 'int' but 'VAR' {aka 'union tagVAR'} was expected
   return item_TFN ();
          ^~~~~~~~~~~
rtrussell commented 1 year ago

math() reserves 92 bytes

But it doesn't matter! All that matters is the stack usage of functions which can be called recursively, because they in turn call a function (most commonly item() or expr()) which can end up calling themselves, and so on. math() is a self-contained function that cannot end up calling itself, so it will just free the stack it used on return and that's it.

And xeq() reserves 300 !

I haven't done anything with xeq() yet, it's in bbexec.c which I'm looking at now.

rtrussell commented 1 year ago

And xeq() reserves 300 !

I've uploaded bbexec.zip in case you want to try it. It compiles (tested with GCC 8.1 and 10.3) and runs, although I haven't yet done extensive testing.

I'm also unsure whether I've caught all the large stack-guzzling culprits, so I will be interested to know how it compares with the 300 bytes of the old version.

Performance wise is it measurably slower than the full-fat stack, as one would expect, but not by so much that it should be of concern on the Pico. I would hope it's no slower than your approach.

This 'nested function' technique (albeit that it's GCC-only) does show promise as a relatively safe and elegant way of reducing the stack usage under control of a simple macro. If you're happy with it, it will hopefully eliminate the necessity to have custom versions of generic interpreter source files.

Memotech-Bill commented 1 year ago

Certainly an improvement (84 bytes):

100113cc <xeq>:
{
100113cc:   b5f0        push    {r4, r5, r6, r7, lr}
100113ce:   46ce        mov lr, r9
100113d0:   4647        mov r7, r8
100113d2:   b580        push    {r7, lr}
100113d4:   b095        sub sp, #84 ; 0x54
100113d6:   9003        str r0, [sp, #12]
VAR xeq (void)
100113d8:   ab1c        add r3, sp, #112    ; 0x70
100113da:   930f        str r3, [sp, #60]   ; 0x3c

However my version does better (28 bytes):

10011ecc <xeq>:
    xeq_default,    //              127
    };

// Execute the program:
VAR xeq (void)
    {
10011ecc:   b5f0        push    {r4, r5, r6, r7, lr}
10011ece:   b087        sub sp, #28
10011ed0:   9000        str r0, [sp, #0]
    bFlgChk = true;

Also, the case statement in your version is implemented as nested IF statements:

        switch (al)
10011410:   0018        movs    r0, r3
10011412:   302c        adds    r0, #44 ; 0x2c
10011414:   d101        bne.n   1001141a <xeq+0x4e>
10011416:   f000 fed4   bl  100121c2 <xeq+0xdf6>
1001141a:   0018        movs    r0, r3
1001141c:   302c        adds    r0, #44 ; 0x2c
1001141e:   dd24        ble.n   1001146a <xeq+0x9e>
10011420:   0018        movs    r0, r3
10011422:   3021        adds    r0, #33 ; 0x21
10011424:   d101        bne.n   1001142a <xeq+0x5e>
10011426:   f000 fd77   bl  10011f18 <xeq+0xb4c>
1001142a:   001a        movs    r2, r3
1001142c:   3221        adds    r2, #33 ; 0x21
1001142e:   dc00        bgt.n   10011432 <xeq+0x66>
10011430:   e18c        b.n 1001174c <xeq+0x380>
10011432:   001a        movs    r2, r3
10011434:   321b        adds    r2, #27
10011436:   dd00        ble.n   1001143a <xeq+0x6e>
10011438:   e1f1        b.n 1001181e <xeq+0x452>
1001143a:   001a        movs    r2, r3
1001143c:   321c        adds    r2, #28
1001143e:   db00        blt.n   10011442 <xeq+0x76>
10011440:   e36b        b.n 10011b1a <xeq+0x74e>
10011442:   001a        movs    r2, r3
10011444:   321f        adds    r2, #31
10011446:   d100        bne.n   1001144a <xeq+0x7e>
10011448:   e1ba        b.n 100117c0 <xeq+0x3f4>
1001144a:   001a        movs    r2, r3
1001144c:   321f        adds    r2, #31
1001144e:   da00        bge.n   10011452 <xeq+0x86>
10011450:   e328        b.n 10011aa4 <xeq+0x6d8>
10011452:   001a        movs    r2, r3
10011454:   321e        adds    r2, #30
10011456:   d101        bne.n   1001145c <xeq+0x90>
10011458:   f000 fed4   bl  10012204 <xeq+0xe38>
1001145c:   331d        adds    r3, #29
1001145e:   d001        beq.n   10011464 <xeq+0x98>
10011460:   f000 ff55   bl  1001230e <xeq+0xf42>

... plus more sections like the above.

Whereas the indirect function call in my version is much more efficient:

        xeq_list[(int)al_token + 0x80]();
10011f0a:   3080        adds    r0, #128    ; 0x80
10011f0c:   0080        lsls    r0, r0, #2
10011f0e:   4b45        ldr r3, [pc, #276]  ; (10012024 <xeq+0x158>)
10011f10:   58c3        ldr r3, [r0, r3]
10011f12:   4798        blx r3
rtrussell commented 1 year ago

Whereas the indirect function call in my version is much more efficient:

I have sufficient confidence in GCC to think it is highly unlikely that your version is "much more efficient". Indeed if I was a betting man (fortunately I'm not!) my money would be on the switch statement.

In my view it's particularly important not to 'second guess' the compiler when targeting multiple platforms and processors, when the code generator will know far more about how to optimise performance on a specific CPU than you or I would.

This is a random comment that I found at Stack Overflow: "Trust the compiler, it will do such micro optimisations much better than you". Even if a compiler doesn't produce the best possible code now, a future version may do better.

Anyway, what matters is whether I've got the stack usage down to a level where you'd be happy to use the standard source files in the Pico build, which is very important to me. It doesn't mean efforts to reduce the stack usage further shouldn't be made, but I would at least have the confidence of knowing that when I (or you) create a Pico build it's using the latest source.

rtrussell commented 1 year ago

Certainly an improvement (84 bytes):

I can't see where that much stack is going. I'd expect enough to save the non-volatile registers but not much else (the only case clauses which I haven't converted to function calls ought to be able to manage with registers rather than spilling variables to the stack).

I assume that if one case has for example an int a and another an int b (both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.

I've updated bbexec.zip; this has the TSOUND case additionally converted to a function call, although I wouldn't expect that to make much if any difference.

Memotech-Bill commented 1 year ago

I have sufficient confidence in GCC to think it is highly unlikely that your version is "much more efficient". Indeed if I was a betting man (fortunately I'm not!) my money would be on the switch statement.

In my view it's particularly important not to 'second guess' the compiler when targeting multiple platforms and processors, when the code generator will know far more about how to optimise performance on a specific CPU than you or I would.

I am not 'second guessing' the compiler, I am looking at the disassembly of what the compiler has actually produced. That is a large mess of comparisons and branches. I think it is a loop unrolled version of a bisection search for the matching case value. That is less efficient than a direct indexed look up in a jump table, which is what my revised version is.

Changing the algorithm at the code level is a 'macro' optimisation, not a 'micro' one.

I can't see where that much stack is going. I'd expect enough to save the non-volatile registers but not much else (the only case clauses which I haven't converted to function calls ought to be able to manage with registers rather than spilling variables to the stack).

I assume that if one case has for example an int a and another an int b (both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.

I currently believe that the compiler is exactly that stupid. Which is why I think it worthwhile to convert every case clause into a separate function, no matter how small.

I am currently working on my alternative approach of a program that will automatically convert the case statement into multiple functions.

rtrussell commented 1 year ago

That is less efficient than a direct indexed look up in a jump table, which is what my revised version is.

You can't possibly say that, with a modern processor. Branch prediction and the like can make naïve assessments of that kind highly misleading. I have no doubt that GCC, especially relatively new versions, will be doing an excellent job of optimising the switch statement.

As an assembly language programmer for something like 50 years, I know that the kind of hand-optimisation which I used to do in the old days (and I think I was pretty good at) just doesn't apply to modern CPUs, and I accept that a good compiler will do a better job than I ever could.

I am currently working on my alternative approach of a program that will automatically convert the case statement into multiple functions.

Please don't waste your time. The only acceptable solution as far as I'm concerned (and I'm sure you would feel the same in my position) is for the Pico build to use the same source files as are used for all the others. I have a put a lot of effort into adding an option to reduce stack size, and even if it's not quite as efficient as yours in that regard I am sure it will be fine.

I would ask you please to agree to using my latest sources. I will then do a much more comprehensive set of tests than I have to date and commit them to the repository. Thanks.

Memotech-Bill commented 1 year ago

Very well. I accept your revisions.

If I may ask for one further conditional block in bbexec.c, as follows:

/******************************* TIME and TIME$ ********************************/

            case TTIMEL:
#ifdef PICO_RTC
                if ( *esi == '$' )
                    {
                    ++esi;
                    equals ();
                    VAR v = exprs ();
                    putims (v.s.p + zero);
                    }
                else
#endif
                {
                long long n ;
                equals () ;
                n = expri () ;
                putime (n) ;
                }
                break ;

If you could either post your final versions of bbeval.c and bbexec.c here, or let me know when you have updated your repository, so that I can revise my makefile accordingly.

rtrussell commented 1 year ago

Very well. I accept your revisions.

Thank you, that is much appreciated.

If I may ask for one further conditional block in bbexec.c, as follows:

Of course, only too happy to. Could we choose something a little less specific than PICO_RTC, perhaps SET_RTC, because I can envisage other systems wanting this capability too.

If you could either post your final versions of bbeval.c and bbexec.c here, or let me know when you have updated your repository, so that I can revise my makefile accordingly.

Certainly. It may take a little while, because I want to do as comprehensive a set of tests as I can. I am very jittery about modifying such a key part of the interpreter.

Just for your interest, and perhaps to help you understand a little better where I am coming from, here's the equivalent code to xeq() from my original Z80 BBC BASIC of 1981:

    ADD A,A
    LD  C,A
    LD  B,0
    LD  HL,CMDTAB
    ADD HL,BC
    LD  A,(HL)      ;TABLE ENTRY
    INC HL
    LD  H,(HL)
    LD  L,A
    JP  (HL)        ;EXECUTE STATEMENT

And here is the equivalent code from BBC BASIC for Windows, 20 years later:

        movzx   ebx,al 
        jmp     [ebx*4+cmdtab]

So I don't need any persuading of the benefits of dispatching via a jump table! Indeed, when I was converting the code to C, ten years or so ago, my initial inclination was to do exactly the same. But because I was very concerned to get the best possible performance I did some research into whether a switch statement might be even better.

The great majority of the references I could find at the time said yes: trust the compiler. If a dispatch table was the best way, that's what it would use. I have never had cause to regret that decision.

rtrussell commented 1 year ago

If I may ask for one further conditional block in bbexec.c, as follows:

/******************************* TIME and TIME$ ********************************/

          case TTIMEL:
#ifdef PICO_RTC
                if ( *esi == '$' )
                    {
                    ++esi;
                    equals ();
                    VAR v = exprs ();
                    putims (v.s.p + zero);
                    }
                else
#endif

I wonder, do you realise that the string at v.s.p + zero is not NUL-terminated, or if so only by chance? To get a terminated string you would have to call fixs() (which puts a CR-terminated string in the 'string accumulator') or similar, alternatively pass both the pointer v.s.p and the length v.s.l of the string to putims(), or indeed pass the entire structure. What shall I do?

Memotech-Bill commented 1 year ago

Of course, only too happy to. Could we choose something a little less specific than PICO_RTC, perhaps SET_RTC, because I can envisage other systems wanting this capability too.

I suggest HAVE_RTC instead of PICO_RTC.

I wonder, do you realise that the string at v.s.p + zero is not NUL-terminated, or if so only by chance? To get a terminated string you would have to call fixs() (which puts a CR-terminated string in the 'string accumulator') or similar, alternatively pass both the pointer v.s.p and the length v.s.l of the string to putims(), or indeed pass the entire structure. What shall I do?

Yes, I know this. However since the permitted formats for the time string are specified I don't need the NUL terminator to know the string length. And as per the manual, if the string does not correspond to the defined format, the results are undefined.

The parsing code is:

void putims (const char *psTime)
    {
    datetime_t  dt;
    rtc_get_datetime (&dt);
    const char *ps = psTime;
    if ( ps[3] == '.' ) ps += 4;
    if ( ps[2] != ':' )
        {
        dt.day = 10 * ( ps[0] - '0' ) + ps[1] - '0';
        ps += 3;
        for (int i = 1; i <= 12; ++i)
            {
            if ( strncasecmp (ps, psMon[i], 3) == 0 )
                {
                dt.month = i;
                break;
                }
            }
        ps += 4;
        dt.year = 1000 * ( ps[0] - '0' ) + 100 * ( ps[1] - '0' ) + 10 * ( ps[2] - '0' ) + ps[3] - '0';
        int iDay = dt.year + dt.year / 4 - dt.year / 100 + dt.year / 400 + iDMon[dt.month-1] + dt.day - 1;
        if (( dt.month < 3 ) && ( dt.year % 4 == 0 ) && (( dt.year % 100 != 0 ) || ( dt.year % 400 == 0 ))) --iDay;
        dt.dotw = iDay % 7;
        ps += 5;
        }
    if ( ps[2] == ':' )
        {
        dt.hour = 10 * ( ps[0] - '0' ) + ps[1] - '0';
        ps += 3;
        dt.min = 10 * ( ps[0] - '0' ) + ps[1] - '0';
        if ( ps[2] == ':' )
            {
            ps += 3;
            dt.sec = 10 * ( ps[0] - '0' ) + ps[1] - '0';
            }
        else
            {
            dt.sec = 0;
            }
        }
    vldtim (&dt);
    rtc_set_datetime (&dt);
    }

You can't possibly say that, with a modern processor. Branch prediction and the like can make naïve assessments of that kind highly misleading. I have no doubt that GCC, especially relatively new versions, will be doing an excellent job of optimising the switch statement.

For modern high-spec processors you are probably correct. However, the Pico has ARM M0+ cores. I don't believe they have branch prediction. Based upon the disassembly of the object code, the compiler has generated ~350 instructions to implement the switch statement and select which case to execute. That compares to just 5 for the jump table. The only way I can imagine the switch code being faster is if all 350 instructions are in cache (and they are not all consecutive) while the jump table is not.

rtrussell commented 1 year ago

I suggest HAVE_RTC instead of PICO_RTC.

OK, although since reading TIME$ (which also must 'have' a real-time clock) is not conditional on such a test, it's a rather inconsistent usage.

Yes, I know this. However since the permitted formats for the time string are specified I don't need the NUL terminator to know the string length.

The 'permitted' formats may be specified, but if somebody enters a shorter string than they should your code could read beyond the end, potentially reading garbage or even unallocated memory. Indeed if the length of the string is zero the pointer can legitimately be NULL and your code will crash with a NULL-pointer exception!

I propose that I add a call to fixs() and pass a CR-terminated string, that would retain compatibility with your existing parsing code.

Based upon the disassembly of the object code, the compiler has generated ~350 instructions to implement the switch statement and select which case to execute.

Which optimisation level were you using?

It makes no sense to me that GCC, which if -O2 is specified usually goes to such lengths to optimise switch statements (some such optimisations being quite amazing, look it up) and will in any case default to using a jump-table (again, look it up), would fail so spectacularly on the Cortex-M0+.

But if for some extraordinary reason it does, it's not for us to work around what is clearly a fault in the compiler or code generator. Report it by all means, if it's really doing that somebody responsible for the compiler needs to know about it, but it's not our problem.

Memotech-Bill commented 1 year ago

I suggest HAVE_RTC instead of PICO_RTC.

OK, although since reading TIME$ (which also must 'have' a real-time clock) is not conditional on such a test, it's a rather inconsistent usage.

In which case use CAN_SET_RTC.

I propose that I add a call to fixs() and pass a CR-terminated string, that would retain compatibility with your existing parsing code.

Instead, please pass the string length. It is easier for me to use that for checking.

Which optimisation level were you using?

All the PicoBB builds use -O1. That is what Eric concluded was best. I believe that choice was based upon restricting stack utilisation.

rtrussell commented 1 year ago

All the PicoBB builds use -O1

Try -O2 with the new stack-reduction code. The noinline attribute should override the -O2 and help keep stack growth down.

Memotech-Bill commented 1 year ago

All the PicoBB builds use -O1

Try -O2 with the new stack-reduction code. The noinline attribute should override the -O2 and help keep stack growth down.

Still the same mess of compare and branch for the main switch statement in xeq(). It would take a lot of work to determine what all the differences are.

rtrussell commented 1 year ago

It would take a lot of work to determine what all the differences are.

In every other build I use -O2 for the core interpreter because it provides a worthwhile performance improvement, and especially on the Pico every little helps.

If the only reason for using -O1 was because of excessive stack growth (presumably through inlining), it might be worth tracing the critical recursive call chain and adding some noinline attributes, conditional on the new REDUCE_STACK_SIZE option.

This would probably be a fair amount of work, but if we could get the stack down to something comparable with what -O1 gives, the -O2 optimisations would almost certainly provide a worthwhile benefit, especially in those functions - the vast majority - which never get called recursively.

I expect very few, if any, other BASIC interpreters bother with stack reduction in this way, so it would give BBC BASIC another string to its bow to offer improved compatibility with platforms with limited RAM.

Do you still have a way of measuring overall stack usage (e.g. by counting levels of recursion in a BASIC program before it blows up) that could be used to compare your existing version, the REDUCE_STACK_SIZE version compiled with -O1 and the same compiled with -O2? That would give us some statistics to judge the potential value of more work.

rtrussell commented 1 year ago

Seems I've discovered why you've been using -O1! Running speed.bbc:

Existing version from this repository: Combined average 335 MHz. My new bbexec.c & bbeval.c with -O1: Combined average 327 MHz My new bbexec.c & bbeval.c with -O2: Combined average 298 MHz My new bbexec.c & bbeval.c with -O3: Combined average 317 MHz

Definite room for improvement in the Cortex-M0+ code generator I would say. ;-)

rtrussell commented 1 year ago

I've been doing a more detailed analysis of the stack usage issue, and it turns out that reducing the size of the stack frame reserved by xeq() or item() will not necessarily increase the maximum depth of recursion! It depends very much on the specific recursive BASIC code concerned.

For example the LET (assignment) statement uses quite a lot of CPU stack, so moving it into a subroutine may well reduce the size of the stack frame allocated by xeq(). But if, in the BASIC program, the recursion happens in an assignment statement, which is quite likely, the amount of stack used for each level of recursion may actually increase, not decrease!

This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!

It will always be beneficial to move into a subroutine statements which use a lot of stack but never call item() or expr() etc. and thus cannot be in the recursive 'chain'. But moving statements that do call item() etc. into a subroutine may actually make things worse, not better.

Memotech-Bill commented 1 year ago

This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!

I don't understand this.

With my version of the code, xeqTLET uses 112 bytes of stack. And the cut-down xeq uses 48 bytes, so a total of 160 bytes. That compares with 328 bytes used by your original xeq. The xeq* routine with the largest stack usage I could find is xeq_TSYS which uses 196 bytes of stack. Adding that used by the residual xeq gives a total of 244 bytes of stack, only 75% of your original.

Considering the case of recursive calls once the individual cases have been moved into subroutines those case routines will use a fixed amount of stack. But each recursion will use the case routine, plus the residual base xeq routine. Hence the stack used by the residual routine should be made as small as possible.

The only cases which should be left inline in xeq are those small ones which result in no increase in the stack usage of the xeq routine itself, but would require a few bytes of stack for the subroutine call if moved out.

rtrussell commented 1 year ago

This is because the stack used per level of recursion will include both the stack used by xeq() and the stack used by the assignment statement, and the latter has been increased by moving it into a subroutine!

I don't understand this.

It's true, nevertheless.

It's easily demonstrated (although time consuming) by writing some test programs that are recursive in different ways - e.g. recursive in the 'end function' (=) statement, or recursive in an assignment, or recursive in a PRINT or whatever - and then experimenting with making each of those statements into a subroutine or not.

I did that yesterday using this program, which is recursive in the 'end function' statement:

      PRINT FN1(0)
      END

      DEF FN1(A%)
      PRINT A%
      = FN1(A% + 1)

Moving the ENDFN clause back to inline code in xeq() rather than a subroutine increased the maximum nesting depth!

There's no perfect answer to this. You could try to analyse lots of recursive programs to see what statement(s) they are most commonly recursive in, but it's not very scientific.

Memotech-Bill commented 1 year ago

ENDFN is somewhat a special case, in that it finishes with a return, rather than continuing around the while loop. This meant that my bbexec2.c had to deal with that case separately. As a result the compiler has inlined the function anyway.

I would be interested in seeing the disassembly, with source code, of your two versions with and without ENDFN inlined.

The SDK does not include source code in the disassembly it produces. However you can produce a disassembly with source code using the command:

arm-none-eabi-objdump -Sd build/bbcbasic.elf >build/bbcbasic.dis

Alternately, please will you post your current version of bbexec.c here so that I can do the experiment.

rtrussell commented 1 year ago

ENDFN is somewhat a special case, in that it finishes with a return

That's unimportant (my 'nested function' approach can handle a non-void function just as easily as a void one). What matters is that it calls expr() and is therefore in the recursive 'chain': its stack frame contributes to the 'stack usage per level of recursion'.

To prove the point, change the recursive statement to something other than ENDFN, for example an assignment:

      PRINT FN1(0)
      END

      DEF FN1(A%)
      PRINT A%
      A% = FN1(A% + 1)
      = A%

This is functionally equivalent to the previous example, but now inlining ENDFN won't reduce the overall stack usage as it did before, it will either increase it or make no difference. However inlining LET, which would have had a negative effect on the previous example will now be beneficial!

Without knowing whether recursion is more common in ENDFN or an assignment (or something else) we can't determine which set of cases to make a subroutine and which to leave inline for optimal results. :-(

I would be interested in seeing the disassembly, with source code, of your two versions with and without ENDFN inlined.

I'm inclined to think that looking at a disassembly can be misleading, it's what led you to believe that reducing the size of the xeq() stack frame is always beneficial, when it isn't. I'd rather use the more direct and meaningful measure of the number of recursions, of a specific test program, before it runs out of stack.

Alternately, please will you post your current version of bbexec.c here so that I can do the experiment.

There is no "current version", I am actively modifying it to see what the trade-offs are. There's an unfortunate complication that an innocuous change to bbexec.c which can't possibly have any direct effect on speed.bbc is in practice having a very significant impact. I've seen this kind of thing before with 32-bit x86 (especially AMD CPUs which fetch code in 32-byte chunks and are therefore exquisitely sensitive to code alignment).

I don't know whether a similar thing is happening with the ARM Cortex-M0+ but it means that a tiny change to the code which may be beneficial in respect of stack usage may have as much as a 20% impact on speed. This is very annoying, especially as the GCC code generator should know about things like alignment and ensure that this kind of side-effect doesn't occur.

Memotech-Bill commented 1 year ago

In the absence of real code I will have to use hypothetical values.

Suppose that xeq_ENDFN as a separate routine uses 16 bytes of stack. Suppose that inlining the code in the base xeq routine saves those 16 bytes at the expense of increasing the stack usage of xeq by 8 bytes. Then if you run your ENDFN test program you save a net 8 bytes per iteration, and can perform more iterations before running out of stack space, as your test shows.

However, suppose you now use your ENDFN optimised code to run any other test iteration (not using ENDFN). You now just get the extra 8 bytes of stack utilisation per iteration from the enlarged xeq routine, so can perform fewer iterations before running out of stack than with the 'unoptimised' code.

The same is true in general. Inlining one case may increase the number of iterations you can do with code that tests that case, but if it makes any increase in the stack utilised by the base xeq routine this will come at the expense of reducing the performance of all iterations that do not use that case.

Any 'optimisation' which improves one specific case at the expense of making all others worse does not seem like a good optimisation.

rtrussell commented 1 year ago

Any 'optimisation' which improves one specific case at the expense of making all others worse does not seem like a good optimisation.

I agree, but it's not really a choice of 'one specific case out of many', the great majority of recursive programs are likely to recurse in only a tiny subset of the possible statements. To take a (silly) example how many programs are likely to recurse in the STEP clause of a FOR statement:

FOR i% = 1 TO 1000 STEP FNrecurse

It's possible, but so unlikely that I don't think it's worthy of consideration or optimisation. I reckon nearly all recursive programs will recurse either in an assignment or in the end-function statement.

With that assumption I've written two test programs (they are the ones I've listed before), one which recurses in an assignment:

      PRINT FN1(0)
      END

      DEF FN1(A%)
      PRINT A%
      A% = FN1(A% + 1)
      = A%

and one which recurses in the end-function statement:

      PRINT FN1(0)
      END

      DEF FN1(A%)
      PRINT A%
      = FN1(A% + 1)

With only two likely possibilities you don't have to make a choice of which you optimise, you can attempt to achieve a compromise which achieves good results for both. This is what I am currently working on.

The main advance which I think I've made is that the 'nested function' approach (unlike your table dispatch approach) provides the opportunity to put part of a case clause in a subroutine whilst keeping part of it inline. That gives me more flexibility, especially with the LET case.

rtrussell commented 1 year ago

I assume that if one case has for example an int a and another an int b (both scoped locally by being in braces) the compiler will use the same register or stack location for both - it would be pretty stupid if not.

I currently believe that the compiler is exactly that stupid.

I'm pleased to say it's not. Inlining a case clause doesn't result in any increase in the size of the stack-frame, so long as its local variable storage can 'overlay' that of another case. The compiler uses the same space for both.

This is good because it means I can inline the ENDFN clause without any negative impact on stack use, and indeed doing so increases the maximum nesting depth of the program which recurses in that statement, without having any effect at all on the one which recurses in an assignment.

So it's a win-win situation, or at least it would be if that change didn't also reduce the performance reported by speed.bbc by about 8%! There's no way that inlining that clause should legitimately have such an impact on speed, if anything I would expect removing the overhead of a function call and return to increase the speed.

It's very frustrating!

rtrussell commented 1 year ago

I've committed the latest versions of bbexec.c and bbeval.c to my repository. The performance I'm getting with these is as follows (figures in brackets are for your latest bbexec2.c and bbeval2.c):

So the stack usage is a little more than yours, but it's about 3% faster - which I think puts to bed the idea that the table-dispatch method is "much more efficient".

The really big advantage, for me, is that the code is identical to that which all the other editions use, except that when REDUCE_STACK_SIZE is specified some of it gets moved into nested functions to reduce stack usage. I was pleased to find that even goto still works, because it's legal to jump out of a nested function to a local label.

I think (hope) the issue can be closed now.

Memotech-Bill commented 1 year ago

Very well.

I have already committed the changes to the Make and CMake files to by default compile using your version of bbeval.c and bbexec.c with REDUCE_STACK_SIZE defined.