DeqingSun / ch55xduino

An Arduino-like programming API for the CH55X
GNU Lesser General Public License v2.1
448 stars 86 forks source link

consider uint16_t for timer0 and millis #6

Closed nerdralph closed 4 years ago

nerdralph commented 4 years ago

https://github.com/DeqingSun/ch55xduino/blob/ch55xduino/ch55xduino/ch55x/cores/ch55xduino/main.c#L33

In my Arduino core for AVR, I used uint16_t to save on code size and interrupt time. 65.5 seconds for delay() would likely be fine for any programs using ch55xduino.

An even more efficient way of implementing it would be to dedicate one of the 4 register banks for ISRs (a common technique in 8051 programming), and dedicate two of the registers to the counter.

DeqingSun commented 4 years ago

You are right, Timer0Interrupt takes about 60\~100 instructions to run, which is about 1\~3% of CPU time. It will be better to use private register banks to reduce 50%\~70% CPU time for millis. Also, timer0_millis and timer0_overflow_count take exactly 8 bytes.

However, I don't think it is a good idea to reduce the variable size. An overflow of 65s will be a big problem for projects with logging or multitasking.

nerdralph commented 4 years ago

It's not the iRAM usage that is the problem, it's all the extra code for manipulating uint32_t 8 bits at a time. Here's a rough outline of a 16-bit counter ISR in 10 instructions and 16 bytes:

push PSW ; 2B
push A
setb RS1 ; 2B bank switch
inc R6   ; 1B
mov A, R6; 1B
jnz endt0; 2B
inc R7   ; 1B
endt0:
pop A
pop PSW
reti     ; 1B

You could extend it to 32 bits with 6 more instructions if you think 65s is too low. I think 65s is plenty even if you want to log data every hour, as the user could write a simple wait_minutes(uint8_t) that calls delay(60000) in a loop.

DeqingSun commented 4 years ago

Yes, I get your idea. I'm saying using a private bank will not use extra iRAM. It will take 8 bytes either way.

For the "BlinkWithoutDelay" type of code. The short overflow may be an issue. Users won't be able to do a timer longer than overflow time. Another concern is, when the user uses code like if (currentMillis - previousMillis >= interval) after the overflow, it will give the wrong result without explicit signed casting (Not tested on SDCC yet, tested on AVR-GCC). So a very long overflow is necessary for beginners.

DeqingSun commented 4 years ago

Did https://github.com/DeqingSun/ch55xduino/commit/2dd35c2c57d9ca72b054871c454b69c9076fe055 in https://github.com/DeqingSun/ch55xduino/tree/playground look good?

nerdralph commented 4 years ago

Did 2dd35c2 in https://github.com/DeqingSun/ch55xduino/tree/playground look good?

That's a big improvement. Does the compiler automatically push/pop the Accumulator? I don't think it does. After I posted the code above, I realized you can avoid using the Accumulator by using:

inc R6
cjne R6, #0, endt0
DeqingSun commented 4 years ago

Did https://github.com/DeqingSun/ch55xduino/commit/60b7a193f90f4d8d336fad265c375f47be5380d5 in https://github.com/DeqingSun/ch55xduino/tree/playground look good?

The code looks cleaner but it didn't save code space or run time. I am using inline assembly instead of assembly function, so prologue and epilogue are handled by the compiler. SDCC did do a good job in main.c.lst .

In "CH55X指令周期.PDF" (Instruction cycle), it mentioned:

CH55X assembly instruction overview:
The number of instruction cycles of non-jump instructions is the same as the number of instruction bytes;
Jump instructions containing MOVC/RET/CALL are usually several cycles longer than the number of bytes;
4 or 5 more cycles of MOVC instructions (5 more when the next instruction address is odd);
3 or 4 more cycles of RET/RETI instruction (4 more when the return address is odd);
2 or 3 more cycles for the remaining instructions (3 more when the target address is odd);
If no jump occurs in the conditional jump instruction, the number of cycles is the same as the number of instruction bytes;

I did a test

void loop() {
  // put your main code here, to run repeatedly:
  EA = 0;

  __asm__ ("    nop    \n");//adjust alignment

  __asm__ ("    mov r4,#0                                \n"
           "   clr P1.4                                   \n"
           "    inc r4                                   \n"
           "    cjne r4,#0,test01 \n"
           "test01:               \n"
           "    inc r4                                   \n"
           "    cjne r4,#0,test02 \n"
           "test02:               \n"
           "    inc r4                                   \n"
           "    cjne r4,#0,test03 \n"
           "test03:               \n"
           "    inc r4                                   \n"
           "    cjne r4,#0,test04 \n"
           "test04:               \n"
           "    setb P1.4                                   \n"
          );

  __asm__ ("    mov r4,#0                                \n"
           "   clr P1.4                                   \n"
           "    inc r4                                   \n"
           "    mov a, r4                                 \n"
           "    jnz test11      \n"
           "test11:               \n"
           "    inc r4                                   \n"
           "    mov a, r4                                 \n"
           "    jnz test12      \n"
           "test12:               \n"
           "    inc r4                                   \n"
           "    mov a, r4                                 \n"
           "    jnz test13      \n"
           "test13:               \n"
           "    inc r4                                   \n"
           "    mov a, r4                                 \n"
           "    jnz test14      \n"
           "test14:               \n"
           "    setb P1.4                                   \n"
          );
  EA = 1;
  delay(1);
}

So each inc and branch takes 6 or 7 instruction cycles, depending on alignment. cjne is 3 bytes, move and jnz are 3 bytes, too. They are the same.

nerdralph commented 4 years ago

Thanks for details about instruction timing. I hadn't seen them before, but had found from testing that jumps usually take 3 extra cycles. I hadn't tested alignment though. Does the compiler generate a prologue and epilogue for the ISR that does a push A/pop A, even though the newer version of the asm code doesn't clobber A?

DeqingSun commented 4 years ago

Yes.

                                    739 ;------------------------------------------------------------
                                    740 ;Allocation info for local variables in function 'Timer0Interrupt'
                                    741 ;------------------------------------------------------------
                                    742 ;   /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:36: void Timer0Interrupt(void) __interrupt (INT_NO_TMR0) __using(1) //using register bank 1
                                    743 ;   -----------------------------------------
                                    744 ;    function Timer0Interrupt
                                    745 ;   -----------------------------------------
      00003F                        746 _Timer0Interrupt:
                           00000F   747     ar7 = 0x0f
                           00000E   748     ar6 = 0x0e
                           00000D   749     ar5 = 0x0d
                           00000C   750     ar4 = 0x0c
                           00000B   751     ar3 = 0x0b
                           00000A   752     ar2 = 0x0a
                           000009   753     ar1 = 0x09
                           000008   754     ar0 = 0x08
      00003F C0 E0            [24]  755     push    acc
      000041 C0 D0            [24]  756     push    psw
      000043 75 D0 08         [24]  757     mov psw,#0x08
                                    758 ;   /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:64: );
                                    759 ;Increase   timer0_overflow_count on R4~R7     
      000046 0C               [12]  760     inc r4                                   
      000047 BC 00 09         [24]  761     cjne    r4,#0,incTimer0_overflow_countOver$ 
      00004A 0D               [12]  762     inc r5                                   
      00004B BD 00 05         [24]  763     cjne    r5,#0,incTimer0_overflow_countOver$ 
      00004E 0E               [12]  764     inc r6                                   
      00004F BE 00 01         [24]  765     cjne    r6,#0,incTimer0_overflow_countOver$ 
      000052 0F               [12]  766     inc r7                                   
      000053                        767     incTimer0_overflow_countOver$:
                                    768 ;Has    timer0_overflow_count inc by 8?         
      000053 74 07            [12]  769     mov a, #7                                
      000055 5C               [12]  770     anl a, r4                                
      000056 70 0D            [24]  771     jnz incTimer0_millisOver$                
                                    772 ;Increase   timer0_millis on R0~R3             
      000058 08               [12]  773     inc r0                                   
      000059 B8 00 09         [24]  774     cjne    r0,#0,incTimer0_millisOver$         
      00005C 09               [12]  775     inc r1                                   
      00005D B9 00 05         [24]  776     cjne    r1,#0,incTimer0_millisOver$         
      000060 0A               [12]  777     inc r2                                   
      000061 BA 00 01         [24]  778     cjne    r2,#0,incTimer0_millisOver$         
      000064 0B               [12]  779     inc r3                                   
      000065                        780     incTimer0_millisOver$:
                                    781 ;   /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:65: }
      000065 D0 D0            [24]  782     pop psw
      000067 D0 E0            [24]  783     pop acc
      000069 32               [24]  784     reti
                                    785 ;   eliminated unneeded push/pop dpl
                                    786 ;   eliminated unneeded push/pop dph
                                    787 ;   eliminated unneeded push/pop b
DeqingSun commented 4 years ago

Fixed in release 0.0.4

nerdralph commented 4 years ago

If you make the ISR naked, the compiler won't include the superfluous push acc/pop acc, saving 4 bytes and 4 cycles. And if you set/clear the bank1 bit in psw, you'll save the 3-byte mov instruction. You'd need to add a nop to maintain the word alignment of the cjne instructions.

DeqingSun commented 4 years ago

@nerdralph The interrupt handler does need to convert timer0_overflow_count to timer0_millis. So acc is needed for every interrupt call. Also, using SETB and CLR to change RS0 and RS1 seems to take 4 bytes.

nerdralph commented 4 years ago

I didn't notice the accumulator use. I think you might be able to remove that, and just count overflows. The adjustment from overflows to micros/millis could be done in the functions themselves. You only need one SETB call to set RS0, and a CLR to clear it at the end. You don't need to touch RS1, as ch55xduino only uses banks 0 & 1.

DeqingSun commented 4 years ago

If we move adjustment to functions, the overflow time will be significantly decreased. I would be more comfortable keeping it similar to the original Arduino algorithm.

Timer 0 interrupt has a pretty high priority. If the user uses RS1, there will be a problem if we don't touch it.

nerdralph commented 4 years ago

I don't think you understand what I mean about only counting overflows in the ISR. The original Arduino algorithm is poorly written, as it makes the ISR more complex in order to simplify the implementation of millis()/micros(). I'm currently writing a millis timer replacement for the AVR core that uses only a timer overflow counter, and doesn't disable interrupts in the millis and micros functions. If you understand AVR asm, it should be easy for you to copy the algorithm for ch55xduino. If not, I'll port it to 8051 assembler after I finish the AVR version.

As for the register banks, if the sketch code can use custom banks, you should document that or else there could be a conflict with bank1 between the timer ISR and the sketch code.

nerdralph commented 4 years ago

You can copy my modified version of delay to reduce the size of ch55xduino. No precision is lost by using 16 bits for micros since it is only incremented by 1000 (10 bits) at a time. https://github.com/nerdralph/ArduinoShrink/blob/master/src/wiring.c#L17

DeqingSun commented 4 years ago

I get the idea of using a smaller data type for the delay. How would you suggest doing the millis function? If we only keep timer0_overflow_count, should we do a 3-bit shift in millis? We will save 7 clockes every 1/8 ms, but there will be 50 clocks more for millis to do the shift. Depending on how frequently millis is called, is it worthwhile?

nerdralph commented 4 years ago

I get the idea of using a smaller data type for the delay. How would you suggest doing the millis function? If we only keep timer0_overflow_count, should we do a 3-bit shift in millis? We will save 7 clockes every 1/8 ms, but there will be 50 clocks more for millis to do the shift. Depending on how frequently millis is called, is it worthwhile?

Saving time in an ISR is more important than saving time in user code. In embedded systems interrupt response time is more important than non-interrupt code. You can also significantly improve the millis() and micros() functions from the Arduino versions by leaving interrupts enabled. https://github.com/nerdralph/ArduinoShrink/blob/master/src/micros.S

For the correction from overflows to millis, multiplying overflows by 6 and then divide by 256 will give the same accuracy as using the Arduino version. This assumes 16Mhz, so for other speeds the math needs to be adjusted.

DeqingSun commented 4 years ago

@nerdralph I made https://github.com/DeqingSun/ch55xduino/commit/0036b1897f4cc1a3a45d4bca96df720b6f4fd2b2 and https://github.com/DeqingSun/ch55xduino/commit/07bcda6436eaf2c64cee613586f6d2969233046e . Not in main branch yet.

So in order to get 4 bytes result in millis, timer0_overflow_count is now 5 bytes. millis() function shift the 5 byte timer0_overflow_count by 3 bits to get millis. Since 51 is not efficient at shifting. multiply by 32 is used instead.

How do you think the 2 commits?

nerdralph commented 4 years ago

@nerdralph I made 0036b18 and 07bcda6 . Not in main branch yet.

So in order to get 4 bytes result in millis, timer0_overflow_count is now 5 bytes. millis() function shift the 5 byte timer0_overflow_count by 3 bits to get millis. Since 51 is not efficient at shifting. multiply by 32 is used instead.

How do you think the 2 commits?

The 5 bytes in the AVR version should not be needed for the 8051. In the AVR version of the timer overflow ISR does not stop incrementing at 4 (or even 5) bytes. It will increment the 5th byte after 2^32 overflows, which will take 49.71 days to happen. It will clobber the 6th byte after 34.8 years, so my ArduinoShrink library assumes the AVR will not run continuously without reset for longer than that.

From my first skim of the code, I have a couple ideas: 1) You don't need to disable interrupts when reading the overflow count. You can read the least significant byte twice, and if it has changed, go back and read the 4 bytes again. https://github.com/nerdralph/ArduinoShrink/blob/master/src/millis.S#L23 3) You might be able to shift efficiently like this:

mov count, #3
loop:
clr C
mov A, R0
rlc
mov R0, A
mov A, R1
rlc
mov R1, A
mov A, R2
rlc
mov R2, A
mov A, R3
rlc
mov R3, A
djnz count, loop
nerdralph commented 4 years ago

I thought of a faster way to increment the t0 overflow count in the ISR:

mov r7, #0
setb C
mov A, r0
addc A, r7
mov r0, A
mov A, r1
addc A, r7
mov r1, A
mov A, r2
addc A, r7
mov r2, A
mov A, r3
addc A, r7
mov r3, A

And here's another version optimized for size (with the counter in r7:r2 (but you'll only need to read r5:r2 for the uint32 count):

mov r0, #0
mov r1, #ar2
setb C
inct0:
mov A, @r1
addc A, r0
mov @r1, A
jz inct0
DeqingSun commented 4 years ago

@nerdralph timer0_overflow_count increase 8 times per second. so the 4-byte variable will overflow after 6 days. If we don't use 5 bytes for it, millis will overflow at 2^29, which makes the overflow handling much more complicated. AVR can make timer trigger every 1 millisecond so 4 bytes are enough, but CH55x can not. I think one extra byte is worthy.

Yes, it is possible to do repeated reading to avoid disabling interrupt. But is it necessary to avoid that for only 10 clocks?

using multiply to do shifting takes about 40 bytes of code for 5 bytes, probably 40 clocks. But rlc takes 13 clocks each bit for 4 bytes. It is not faster.

using addc instead of branching is faster in the worse case, but slow is the best cases. Since 255 of 256 times we only need to add one byte, branching is better I think.

nerdralph commented 4 years ago

@nerdralph timer0_overflow_count increase 8 times per second. so the 4-byte variable will overflow after 6 days. If we don't use 5 bytes for it, millis will overflow at 2^29, which makes the overflow handling much more complicated. AVR can make timer trigger every 1 millisecond so 4 bytes are enough, but CH55x can not. I think one extra byte is worthy.

OK, I see now that the only prescaler options for T0 are /12 (which you are using), /4, or /1(no prescale). If you think overflow in 6 days is going to cause problems vs 50 days for the Arduino AVR core, then use the 5th byte.

Yes, it is possible to do repeated reading to avoid disabling interrupt. But is it necessary to avoid that for only 10 clocks?

Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.

using multiply to do shifting takes about 40 bytes of code for 5 bytes, probably 40 clocks. But rlc takes 13 clocks each bit for 4 bytes. It is not faster.

Inside interrupts, speed & determinism is most important. Outside interrupts code size is usually more important than speed.

using addc instead of branching is faster in the worse case, but slow is the best cases. Since 255 of 256 times we only need to add one byte, branching is better I think.

The main difference is mov/addc/mov is deterministic - it always takes the same number of cycles. It's not a big difference, so I can see why you may want to keep the inc/cjne version.

DeqingSun commented 4 years ago

Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.

Checking the lowest byte only is not going to work well. There are other interrupts such as USB one. It may take more than 32ms, although not likely but it will be a bug. And such a bug will be extremely difficult to debug. Checking the lowest 3 bytes will be much safer.

Inside interrupts, speed & determinism is most important. Outside interrupts code size is usually more important than speed.

The main difference is mov/addc/mov is deterministic - it always takes the same number of cycles. It's not a big difference, so I can see why you may want to keep the inc/cjne version.

Inside interrupts I prefer speed. So far I don't see how Arduino style code can be benefited from deterministic. On the code side, I think there is still a lot of other stuff to optimize, such as linking. There is still a lot of unnecessary functions linked to Arduino sketch.

nerdralph commented 4 years ago

Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.

Checking the lowest byte only is not going to work well. There are other interrupts such as USB one. It may take more than 32ms, although not likely but it will be a bug. And such a bug will be extremely difficult to debug. Checking the lowest 3 bytes will be much safer.

Now you don't seem to know what you are talking about. Any interrupt taking 32ms is a bug. If I had an embedded programmer working for me that wrote an ISR that took even 1ms on a MCU running at 24Mhz, he'd be fired. Most of the ISR code I write takes less than 100 cycles.

Do realize that any ISR taking more than 125uS will stop the millis counter from working? If you don't understand realtime embedded software development then I don't see much point in trying to help you with ch55xduino.

DeqingSun commented 4 years ago

For a professional programmer, a 32ms is a bug. I never wrote such a long interrupt. But this is an Arduino board support with attachInterrupt. You never know what the user will do. Comparing more than 1 byte eliminates the possibility. Or we just add another flag using 1 bit for an interrupt to inform the main code the value is updated.

No, long ISR will not stop millis counter from working. INT_NO_TMR0 has priority only lower to INT_NO_INT0 interrupt. Interrupt nesting will occur.

nerdralph commented 4 years ago

No, long ISR will not stop millis counter from working. INT_NO_TMR0 has priority only lower to INT_NO_INT0 interrupt. Interrupt nesting will occur.

You are proving my suspicion that you don't understand how 8051 interrupts work. To have T0 interrupt all other ISRs, you'd have to make it a high priority interrupt by setting PT0 bit in the IP register. I'm pretty sure you're not doing that. Github doesn't support searching in forks, so can't quickly search all your repository for PT0, so there's a slight chance I missed it.

I don't have the time to be teaching you this stuff, especially when you think you already know it all. I may check back on the project in a couple months to see if you have progressed.