Closed nerdralph closed 4 years ago
You are right, Timer0Interrupt takes about 60\~100 instructions to run, which is about 1\~3% of CPU time. It will be better to use private register banks to reduce 50%\~70% CPU time for millis. Also, timer0_millis and timer0_overflow_count take exactly 8 bytes.
However, I don't think it is a good idea to reduce the variable size. An overflow of 65s will be a big problem for projects with logging or multitasking.
It's not the iRAM usage that is the problem, it's all the extra code for manipulating uint32_t 8 bits at a time. Here's a rough outline of a 16-bit counter ISR in 10 instructions and 16 bytes:
push PSW ; 2B
push A
setb RS1 ; 2B bank switch
inc R6 ; 1B
mov A, R6; 1B
jnz endt0; 2B
inc R7 ; 1B
endt0:
pop A
pop PSW
reti ; 1B
You could extend it to 32 bits with 6 more instructions if you think 65s is too low. I think 65s is plenty even if you want to log data every hour, as the user could write a simple wait_minutes(uint8_t) that calls delay(60000) in a loop.
Yes, I get your idea. I'm saying using a private bank will not use extra iRAM. It will take 8 bytes either way.
For the "BlinkWithoutDelay" type of code. The short overflow may be an issue. Users won't be able to do a timer longer than overflow time. Another concern is, when the user uses code like if (currentMillis - previousMillis >= interval)
after the overflow, it will give the wrong result without explicit signed casting (Not tested on SDCC yet, tested on AVR-GCC). So a very long overflow is necessary for beginners.
Did 2dd35c2 in https://github.com/DeqingSun/ch55xduino/tree/playground look good?
That's a big improvement. Does the compiler automatically push/pop the Accumulator? I don't think it does. After I posted the code above, I realized you can avoid using the Accumulator by using:
inc R6
cjne R6, #0, endt0
Did https://github.com/DeqingSun/ch55xduino/commit/60b7a193f90f4d8d336fad265c375f47be5380d5 in https://github.com/DeqingSun/ch55xduino/tree/playground look good?
The code looks cleaner but it didn't save code space or run time. I am using inline assembly instead of assembly function, so prologue and epilogue are handled by the compiler. SDCC did do a good job in main.c.lst .
In "CH55X指令周期.PDF" (Instruction cycle), it mentioned:
CH55X assembly instruction overview:
The number of instruction cycles of non-jump instructions is the same as the number of instruction bytes;
Jump instructions containing MOVC/RET/CALL are usually several cycles longer than the number of bytes;
4 or 5 more cycles of MOVC instructions (5 more when the next instruction address is odd);
3 or 4 more cycles of RET/RETI instruction (4 more when the return address is odd);
2 or 3 more cycles for the remaining instructions (3 more when the target address is odd);
If no jump occurs in the conditional jump instruction, the number of cycles is the same as the number of instruction bytes;
I did a test
void loop() {
// put your main code here, to run repeatedly:
EA = 0;
__asm__ (" nop \n");//adjust alignment
__asm__ (" mov r4,#0 \n"
" clr P1.4 \n"
" inc r4 \n"
" cjne r4,#0,test01 \n"
"test01: \n"
" inc r4 \n"
" cjne r4,#0,test02 \n"
"test02: \n"
" inc r4 \n"
" cjne r4,#0,test03 \n"
"test03: \n"
" inc r4 \n"
" cjne r4,#0,test04 \n"
"test04: \n"
" setb P1.4 \n"
);
__asm__ (" mov r4,#0 \n"
" clr P1.4 \n"
" inc r4 \n"
" mov a, r4 \n"
" jnz test11 \n"
"test11: \n"
" inc r4 \n"
" mov a, r4 \n"
" jnz test12 \n"
"test12: \n"
" inc r4 \n"
" mov a, r4 \n"
" jnz test13 \n"
"test13: \n"
" inc r4 \n"
" mov a, r4 \n"
" jnz test14 \n"
"test14: \n"
" setb P1.4 \n"
);
EA = 1;
delay(1);
}
So each inc and branch takes 6 or 7 instruction cycles, depending on alignment. cjne is 3 bytes, move and jnz are 3 bytes, too. They are the same.
Thanks for details about instruction timing. I hadn't seen them before, but had found from testing that jumps usually take 3 extra cycles. I hadn't tested alignment though. Does the compiler generate a prologue and epilogue for the ISR that does a push A/pop A, even though the newer version of the asm code doesn't clobber A?
Yes.
739 ;------------------------------------------------------------
740 ;Allocation info for local variables in function 'Timer0Interrupt'
741 ;------------------------------------------------------------
742 ; /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:36: void Timer0Interrupt(void) __interrupt (INT_NO_TMR0) __using(1) //using register bank 1
743 ; -----------------------------------------
744 ; function Timer0Interrupt
745 ; -----------------------------------------
00003F 746 _Timer0Interrupt:
00000F 747 ar7 = 0x0f
00000E 748 ar6 = 0x0e
00000D 749 ar5 = 0x0d
00000C 750 ar4 = 0x0c
00000B 751 ar3 = 0x0b
00000A 752 ar2 = 0x0a
000009 753 ar1 = 0x09
000008 754 ar0 = 0x08
00003F C0 E0 [24] 755 push acc
000041 C0 D0 [24] 756 push psw
000043 75 D0 08 [24] 757 mov psw,#0x08
758 ; /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:64: );
759 ;Increase timer0_overflow_count on R4~R7
000046 0C [12] 760 inc r4
000047 BC 00 09 [24] 761 cjne r4,#0,incTimer0_overflow_countOver$
00004A 0D [12] 762 inc r5
00004B BD 00 05 [24] 763 cjne r5,#0,incTimer0_overflow_countOver$
00004E 0E [12] 764 inc r6
00004F BE 00 01 [24] 765 cjne r6,#0,incTimer0_overflow_countOver$
000052 0F [12] 766 inc r7
000053 767 incTimer0_overflow_countOver$:
768 ;Has timer0_overflow_count inc by 8?
000053 74 07 [12] 769 mov a, #7
000055 5C [12] 770 anl a, r4
000056 70 0D [24] 771 jnz incTimer0_millisOver$
772 ;Increase timer0_millis on R0~R3
000058 08 [12] 773 inc r0
000059 B8 00 09 [24] 774 cjne r0,#0,incTimer0_millisOver$
00005C 09 [12] 775 inc r1
00005D B9 00 05 [24] 776 cjne r1,#0,incTimer0_millisOver$
000060 0A [12] 777 inc r2
000061 BA 00 01 [24] 778 cjne r2,#0,incTimer0_millisOver$
000064 0B [12] 779 inc r3
000065 780 incTimer0_millisOver$:
781 ; /Users/sundeqing/Library/Arduino15/packages/CH55xDuino/hardware/mcs51/0.0.3/cores/ch55xduino/main.c:65: }
000065 D0 D0 [24] 782 pop psw
000067 D0 E0 [24] 783 pop acc
000069 32 [24] 784 reti
785 ; eliminated unneeded push/pop dpl
786 ; eliminated unneeded push/pop dph
787 ; eliminated unneeded push/pop b
Fixed in release 0.0.4
If you make the ISR naked, the compiler won't include the superfluous push acc/pop acc, saving 4 bytes and 4 cycles. And if you set/clear the bank1 bit in psw, you'll save the 3-byte mov instruction. You'd need to add a nop to maintain the word alignment of the cjne instructions.
@nerdralph The interrupt handler does need to convert timer0_overflow_count to timer0_millis. So acc is needed for every interrupt call. Also, using SETB and CLR to change RS0 and RS1 seems to take 4 bytes.
I didn't notice the accumulator use. I think you might be able to remove that, and just count overflows. The adjustment from overflows to micros/millis could be done in the functions themselves. You only need one SETB call to set RS0, and a CLR to clear it at the end. You don't need to touch RS1, as ch55xduino only uses banks 0 & 1.
If we move adjustment to functions, the overflow time will be significantly decreased. I would be more comfortable keeping it similar to the original Arduino algorithm.
Timer 0 interrupt has a pretty high priority. If the user uses RS1, there will be a problem if we don't touch it.
I don't think you understand what I mean about only counting overflows in the ISR. The original Arduino algorithm is poorly written, as it makes the ISR more complex in order to simplify the implementation of millis()/micros(). I'm currently writing a millis timer replacement for the AVR core that uses only a timer overflow counter, and doesn't disable interrupts in the millis and micros functions. If you understand AVR asm, it should be easy for you to copy the algorithm for ch55xduino. If not, I'll port it to 8051 assembler after I finish the AVR version.
As for the register banks, if the sketch code can use custom banks, you should document that or else there could be a conflict with bank1 between the timer ISR and the sketch code.
You can copy my modified version of delay to reduce the size of ch55xduino. No precision is lost by using 16 bits for micros since it is only incremented by 1000 (10 bits) at a time. https://github.com/nerdralph/ArduinoShrink/blob/master/src/wiring.c#L17
I get the idea of using a smaller data type for the delay. How would you suggest doing the millis function? If we only keep timer0_overflow_count, should we do a 3-bit shift in millis? We will save 7 clockes every 1/8 ms, but there will be 50 clocks more for millis to do the shift. Depending on how frequently millis is called, is it worthwhile?
I get the idea of using a smaller data type for the delay. How would you suggest doing the millis function? If we only keep timer0_overflow_count, should we do a 3-bit shift in millis? We will save 7 clockes every 1/8 ms, but there will be 50 clocks more for millis to do the shift. Depending on how frequently millis is called, is it worthwhile?
Saving time in an ISR is more important than saving time in user code. In embedded systems interrupt response time is more important than non-interrupt code. You can also significantly improve the millis() and micros() functions from the Arduino versions by leaving interrupts enabled. https://github.com/nerdralph/ArduinoShrink/blob/master/src/micros.S
For the correction from overflows to millis, multiplying overflows by 6 and then divide by 256 will give the same accuracy as using the Arduino version. This assumes 16Mhz, so for other speeds the math needs to be adjusted.
@nerdralph I made https://github.com/DeqingSun/ch55xduino/commit/0036b1897f4cc1a3a45d4bca96df720b6f4fd2b2 and https://github.com/DeqingSun/ch55xduino/commit/07bcda6436eaf2c64cee613586f6d2969233046e . Not in main branch yet.
So in order to get 4 bytes result in millis, timer0_overflow_count is now 5 bytes. millis() function shift the 5 byte timer0_overflow_count by 3 bits to get millis. Since 51 is not efficient at shifting. multiply by 32 is used instead.
How do you think the 2 commits?
@nerdralph I made 0036b18 and 07bcda6 . Not in main branch yet.
So in order to get 4 bytes result in millis, timer0_overflow_count is now 5 bytes. millis() function shift the 5 byte timer0_overflow_count by 3 bits to get millis. Since 51 is not efficient at shifting. multiply by 32 is used instead.
How do you think the 2 commits?
The 5 bytes in the AVR version should not be needed for the 8051. In the AVR version of the timer overflow ISR does not stop incrementing at 4 (or even 5) bytes. It will increment the 5th byte after 2^32 overflows, which will take 49.71 days to happen. It will clobber the 6th byte after 34.8 years, so my ArduinoShrink library assumes the AVR will not run continuously without reset for longer than that.
From my first skim of the code, I have a couple ideas: 1) You don't need to disable interrupts when reading the overflow count. You can read the least significant byte twice, and if it has changed, go back and read the 4 bytes again. https://github.com/nerdralph/ArduinoShrink/blob/master/src/millis.S#L23 3) You might be able to shift efficiently like this:
mov count, #3
loop:
clr C
mov A, R0
rlc
mov R0, A
mov A, R1
rlc
mov R1, A
mov A, R2
rlc
mov R2, A
mov A, R3
rlc
mov R3, A
djnz count, loop
I thought of a faster way to increment the t0 overflow count in the ISR:
mov r7, #0
setb C
mov A, r0
addc A, r7
mov r0, A
mov A, r1
addc A, r7
mov r1, A
mov A, r2
addc A, r7
mov r2, A
mov A, r3
addc A, r7
mov r3, A
And here's another version optimized for size (with the counter in r7:r2 (but you'll only need to read r5:r2 for the uint32 count):
mov r0, #0
mov r1, #ar2
setb C
inct0:
mov A, @r1
addc A, r0
mov @r1, A
jz inct0
@nerdralph timer0_overflow_count increase 8 times per second. so the 4-byte variable will overflow after 6 days. If we don't use 5 bytes for it, millis will overflow at 2^29, which makes the overflow handling much more complicated. AVR can make timer trigger every 1 millisecond so 4 bytes are enough, but CH55x can not. I think one extra byte is worthy.
Yes, it is possible to do repeated reading to avoid disabling interrupt. But is it necessary to avoid that for only 10 clocks?
using multiply to do shifting takes about 40 bytes of code for 5 bytes, probably 40 clocks. But rlc takes 13 clocks each bit for 4 bytes. It is not faster.
using addc instead of branching is faster in the worse case, but slow is the best cases. Since 255 of 256 times we only need to add one byte, branching is better I think.
@nerdralph timer0_overflow_count increase 8 times per second. so the 4-byte variable will overflow after 6 days. If we don't use 5 bytes for it, millis will overflow at 2^29, which makes the overflow handling much more complicated. AVR can make timer trigger every 1 millisecond so 4 bytes are enough, but CH55x can not. I think one extra byte is worthy.
OK, I see now that the only prescaler options for T0 are /12 (which you are using), /4, or /1(no prescale). If you think overflow in 6 days is going to cause problems vs 50 days for the Arduino AVR core, then use the 5th byte.
Yes, it is possible to do repeated reading to avoid disabling interrupt. But is it necessary to avoid that for only 10 clocks?
Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.
using multiply to do shifting takes about 40 bytes of code for 5 bytes, probably 40 clocks. But rlc takes 13 clocks each bit for 4 bytes. It is not faster.
Inside interrupts, speed & determinism is most important. Outside interrupts code size is usually more important than speed.
using addc instead of branching is faster in the worse case, but slow is the best cases. Since 255 of 256 times we only need to add one byte, branching is better I think.
The main difference is mov/addc/mov is deterministic - it always takes the same number of cycles. It's not a big difference, so I can see why you may want to keep the inc/cjne version.
Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.
Checking the lowest byte only is not going to work well. There are other interrupts such as USB one. It may take more than 32ms, although not likely but it will be a bug. And such a bug will be extremely difficult to debug. Checking the lowest 3 bytes will be much safer.
Inside interrupts, speed & determinism is most important. Outside interrupts code size is usually more important than speed.
The main difference is mov/addc/mov is deterministic - it always takes the same number of cycles. It's not a big difference, so I can see why you may want to keep the inc/cjne version.
Inside interrupts I prefer speed. So far I don't see how Arduino style code can be benefited from deterministic. On the code side, I think there is still a lot of other stuff to optimize, such as linking. There is still a lot of unnecessary functions linked to Arduino sketch.
Why wouldn't you want to improve interrupt latency when it comes at virtually no cost? Checking the lowest byte and reloading takes about the same amount of cycles and code as disabling and re-enabling interrupts.
Checking the lowest byte only is not going to work well. There are other interrupts such as USB one. It may take more than 32ms, although not likely but it will be a bug. And such a bug will be extremely difficult to debug. Checking the lowest 3 bytes will be much safer.
Now you don't seem to know what you are talking about. Any interrupt taking 32ms is a bug. If I had an embedded programmer working for me that wrote an ISR that took even 1ms on a MCU running at 24Mhz, he'd be fired. Most of the ISR code I write takes less than 100 cycles.
Do realize that any ISR taking more than 125uS will stop the millis counter from working? If you don't understand realtime embedded software development then I don't see much point in trying to help you with ch55xduino.
For a professional programmer, a 32ms is a bug. I never wrote such a long interrupt. But this is an Arduino board support with attachInterrupt. You never know what the user will do. Comparing more than 1 byte eliminates the possibility. Or we just add another flag using 1 bit for an interrupt to inform the main code the value is updated.
No, long ISR will not stop millis counter from working. INT_NO_TMR0 has priority only lower to INT_NO_INT0 interrupt. Interrupt nesting will occur.
No, long ISR will not stop millis counter from working. INT_NO_TMR0 has priority only lower to INT_NO_INT0 interrupt. Interrupt nesting will occur.
You are proving my suspicion that you don't understand how 8051 interrupts work. To have T0 interrupt all other ISRs, you'd have to make it a high priority interrupt by setting PT0 bit in the IP register. I'm pretty sure you're not doing that. Github doesn't support searching in forks, so can't quickly search all your repository for PT0, so there's a slight chance I missed it.
I don't have the time to be teaching you this stuff, especially when you think you already know it all. I may check back on the project in a couple months to see if you have progressed.
https://github.com/DeqingSun/ch55xduino/blob/ch55xduino/ch55xduino/ch55x/cores/ch55xduino/main.c#L33
In my Arduino core for AVR, I used uint16_t to save on code size and interrupt time. 65.5 seconds for delay() would likely be fine for any programs using ch55xduino.
An even more efficient way of implementing it would be to dedicate one of the 4 register banks for ISRs (a common technique in 8051 programming), and dedicate two of the registers to the counter.