SpenceKonde / megaTinyCore

Arduino core for the tinyAVR 0/1/2-series - Ones's digit 2,4,5,7 (pincount, 8,14,20,24), tens digit 0, 1, or 2 (featureset), preceded by flash in kb. Library maintainers: porting help available!
Other
544 stars 141 forks source link

Little improvement on TCA/TCD ISR #963

Closed MX682X closed 1 year ago

MX682X commented 1 year ago

Don't ask me why or how, but I tried to reduce ISR size by reusing the 32-bit increment/decrement of overflow and millis by using jumps, but this resulted in an increase of clocks, so I came up with a way to get rid of another push/pop of a register, by reordering the timer_struct. The problem, however, is that this is still slower (2clocks) compared to the old version (with a reduction of about 10 words in size), so I decided to backport the iprovements to the "unrolled" version and ended up with a slight reduction of size and clock count.

In the end, both versions are present in the code, but I decided to roll with the reduced footprint.

Also, the clock count refers to the worst case. If MILLIS_INC is 0 (versions above 10MHz I think) then millis are only incremented when FRACT_MAX is reached, thus decreasing average time in ISR further. But, to do so, I need the pre-processor, which didn't like the uint16_t and uint32_t in the macros at all, as it doesn't have to know what it is, so I removed them. Pretty sure that gcc will figure out the correct type itself. If not, a tst+branch instruction must be added

SpenceKonde commented 1 year ago

Not going into 2.6.8; will merge after release of 2.6.8

SpenceKonde commented 1 year ago

2.6.8 is now released, merging

SpenceKonde commented 1 year ago

Well, the harddrive (or my sliver of it at AWS us-east-2) on the webserver that serves the board manager json file became corrupted today, and the /var/www/html directory became a file (among other scattered file system damage.

So I've got to rebuild the webserver pronto so need to rebuild drazzy.com to some extent before anyone can install the core. And before any further development is done by me on the core because CI is broken.

But looking at those code changes, I see two things are I am very concerned about

First, you are violating contraints. If you are using a pointer with LD Z+, LD Y+ or LD X+, LD -Z, LD -Y, LD -Z or the ST versions of those same instructions, however YOU ARE CHANGING THE POINTER. Hence, the constraint must be read-write. not read only, nothing good comes from lying to the compiler.

Secondly What are you doing in here?

//this
#define clockCyclesPerMicrosecond() ((uint16_t)(F_CPU / 1000000L))
#define clockCyclesToMicroseconds(a) ((uint32_t)((a) / clockCyclesPerMicrosecond()))
#define microsecondsToClockCycles(a) ((uint32_t)((a) * clockCyclesPerMicrosecond()))
//changed to this
#define clockCyclesPerMicrosecond()       ((F_CPU / 1000000UL))
#define clockCyclesToMicroseconds(__a__)  (__a__ / clockCyclesPerMicrosecond())
#define microsecondsToClockCycles(__a__)  (__a__ * clockCyclesPerMicrosecond())
#ifdef MILLIS_USE_TIMERD0
  #if (F_CPU == 20000000UL || F_CPU == 10000000UL ||F_CPU == 5000000UL)
    #define millisClockCyclesPerMicrosecond() ((uint16_t) (20))  // this always runs off the 20MHz oscillator
    #define millisClockCyclesPerMicrosecond() (20)  // this always runs off the 20MHz oscillator
  #else
    #define millisClockCyclesPerMicrosecond() ((uint16_t) (16))
    #define millisClockCyclesPerMicrosecond() (16)
  #endif
#else
  #define millisClockCyclesPerMicrosecond() ((uint16_t) ((F_CPU) / 1000000L))
  #define millisClockCyclesPerMicrosecond() (F_CPU / 1000000UL)
#endif
#define millisClockCyclesToMicroseconds(a) ((uint32_t)((a) / millisClockCyclesPerMicrosecond()))
#define microsecondsToMillisClockCycles(a) ((uint32_t)((a) * millisClockCyclesPerMicrosecond()))
#define millisClockCyclesToMicroseconds(__a__) (__a__ / millisClockCyclesPerMicrosecond())
#define microsecondsToMillisClockCycles(__a__) (__a__ * millisClockCyclesPerMicrosecond())

What is this __a__ syntax? Where is it documented? Obviously it's ungooglable. I've never seen anything like that in my life.

The reason I went and put all those casts in there (they were first done on the things that are used to look up pin params, and then I rolled it out to all macros I wasn't certain were safe even if (like in this case) I wasn't sure) The problem with the pin-related ones was that digitalPinTo___() assumed that pin number was a uint8_t, and when we call those macros within the core, the core itself always passes it uint8_t's so the comparison becomes an unsigned comparison, and the NOT_A_PIN (255 aka -1) value fails if (pin < NUM_DIGITAL_PINS) which is a way to detect valid pins with 1 rather than 2 comparison instructions. But if the author of a third party library were to do so but used an int instead of a uint8_t - since the default type for a immediate value is an int, suddenly, we are performing a SIGNED COMPARE instead of an unsigned compare. -1 (255) now PASSES the test if (pin < NUM_DIGITAL_PINS. Hence , negative pin values would appear to the core to be valid pins and Bad Things could happen

Oh - do you happen to have a faster or smaller trick than the best I can come up with for a common task in AVR - namely given bitmask m, calculate bit position p.

// assume we passed in one read write variable, %0, holding the mask and want that register to hold the position at the end. 
// Notice the reuse of r1 as a ready-to-use zero. 
mov r0, %0
dec r0;
and r0, %0; // do the N & (N-1) test to see that 
cpse r0, r1; // Only rjmp to the place where we load it with 255 if we aren't passed a pinmask and hence the function is meaningless.  
rjmp .+18; 
clc;
lsr %0;
brcs .+4
inc r1 // r1 will have the bit pos
rjmp .-8
mov %0, r1  // move it to %0 
eor r1          //restore r1 to 0 before we reenter C code. 
rjmp .+2
ldi r18, 255 // 
: "+d"(uint8_t) mask_or_pos

But this is kinda slow. 5 clocks in the loop, then 7 setup and 4 teardown. So it'll take 16, 21, 26, 31, 36, 41, 46, or 51 clocks per call. I dont like the delay and I don't like the inconsistency. There has got be be a better way to do this. right? I'm not sure the time efficiency of havinga 129-member array in the mapped PROGMEM section - but i'm not confident that it would be any faster.

There is no meaningful content below this line, just musing and absurd logic

I feel like with how heavily they lean on bitmasks, they should really have a bmtbp instruction, takes a bitmask in rd, converts it to a bit position, and places the result in rd. Or sets rd to 255 if it wasn't given a valid bitmask. So that would return 0-7, or 255/-1 Would only take 32 opcodes to implement compared with the rest of my wishlist, most entries on which require more opcodes than exist. It's too bad 2^16 only comes out to 65535.

Total nonsense below We would have so much more freedom and power if 2^16 went up to even 131071.... but that would require some changes to the basic rules of mathematics, and I'm pretty sure those are inherent and immutable in the universe as as we know it. So that's off the table. Pity, too, I need some basic changes of my own made to the nature of the universe. First off I need some immovable objects an a half dozen magnetic monopoles for my perpetual motion machine, and secondly, we gotta do something about this conservation of momentum thing - it's a killer for space travel, and while we're at it, this speed of light bullshit? How are we going to visit other planets like THIS? We're almost out of earth left to pollute and despoil, then what will we do?! Most of the rest of the solar system, physics got there first and trashed it all before we evolved, so we'd have to go find some other habitable planet. Humans have been destroying their surroundings for millenia - I'm not sure it's a habit you can just quit cold turkey, and I don't think we have long to go before we render earth "barely inhabitable" but with that slowass speed of light limit, how we gonna find another planet to crap up with invasive species, non-degrading rubbish, and chemical contaminants? I mean sure we could move some dirty industries into the planets in this solar system. Maybe we could get some process that produces flourine as a waste product, All we'dneed to to then is react that with short hydrocarbons and we'd have some of those super greenhouse gasses that everyone warns about (GWP = >15000 compounds exist) and vent em into the martian atmosphere. They also say there are perchlorates in the soil, We could use those to get some oxygen for the atomosphere, and if we could get CO2 out of some of the rocks, we'd only need to figure out how to get some inert gas to make up the rest of the nominal air pressure a human would need living on mars.... But like, that's just not the same as going onto some extrasolar planet where life has just oxygenated the atmosphere, then setting up all your filthiest industries there and pollut the hell of of the other planet before anything with a nervous system has even evolved.... I mean, that is what a lot of peoples goal is right? Maximizing pollution? Because it if's not, there's something about the world that doesn't add up.

MX682X commented 1 year ago

First, you are violating contraints. If you are using a pointer with LD Z+, LD Y+ or LD X+, LD -Z, LD -Y, LD -Z or the ST versions of those same instructions, however YOU ARE CHANGING THE POINTER. Hence, the constraint must be read-write. not read only, nothing good comes from lying to the compiler.

the read/write and read-only constraints are for the compiler to know what to expect at the end of the inline asm. As this code is inside a (naked) ISR, it doesn't matter as we restore everything to what it was.

What is this a syntax? Where is it documented? Obviously it's ungooglable. I've never seen anything like that in my life.

somewhere, somewhen I've read that there might be collisions with variable names, if a user would declare an a variable in the same scope, or something along those lines, so I've learned to always use the underscores for #define variables.

The reason I went and put all those casts in there...

I needed an #if MILLIS_INC != 0 for the assembly to save some clocks and words, but the Pre-Processor doesn't understand uintX_t, so I decided to remove them. I've done it for the reason of cleanliness and because this macros are exclusively positive, except of course you are using AVRs for your time-machine.

Oh - do you happen to have a faster or smaller trick than the best I can come up with for a common task in AVR - namely given bitmask m, calculate bit position p.

Maybe this might wok? (not sure about the branch counts though). Also you sure with the ldi r18, or should it be %0?

mov  r0, %0     // Move the bm to temp reg.
subi %0, 0x01   // Two options: %0 >= 1, Carry gets cleared. %0 < 1, detected by next insn
and %0, r0      // do the N & (N-1) test. If bm is valid, %0 will be 0 and Z = 1
brne .+10       // If Z == 0, we got a bad value, jump to the end
lsr r0
brcs .+6
inc %0          // %0 will start at 0 due to the AND and will have the bit pos
rjmp .-8
ldi r18, 255 // 
: "+d"(uint8_t) mask_or_pos

How are we going to visit other planets like THIS?

The advanage, however, is that by travaling near the speed of light, we would require less life support, as the time on the ship would be passing slower. Let's n0t talk about the energy required to accelerate and decelerate though....

a human would need living on mars

Don't forget about the space radiation. There is a Kurzgesagt video that talks about that

EDIT: P.S.: Of course it would be better code quality wise to fix the constraints to what they should be. If someone decides to study asm based on the provided "examples" it would certanly be better to have everything according to the standards

MX682X commented 1 year ago

alternative to reduce clock count by swaping:

mov  r0, %0     // Move the bm to temp reg.
dec r0       //
and r0, %0      // do the N & (N-1) test. If bm is valid, r0 will be 0 and Z = 1
brne .+20       // If Z == 0, we got a bad value, jump to the end
cpi %0, 0x10    // C = 1 if bit is in low nibble
mov r0, %0      // move the bm to r0 again
ldi %0, 0x00    // init counter at 0
brcs .+6        // C == 1, skip over swap
swap r0         // C == 0, swap and skip the low nibble
ldi %0, 0x04    // if we're swap-ing, init counter at 4
lsr r0          // left-shift, bit0 is moved to C
brcs .+6        // if bit0 was set, we have our bit position, finish loop
inc %0          // %0 will start at 0 due to the AND or at 4 after SWAP
rjmp .-8        //
ldi r18, 255    //
: "+d"(uint8_t) mask_or_pos

0x01: 12 (+ 3 5) clocks 0x10: 13 (+ 3 5) clocks

Alternative 2: unrolled loop:

mov  r0, %0     // Move the bm to temp reg.
dec r0       //
and r0, %0      // do the N & (N-1) test. If bm is valid, r0 will be 0 and Z = 1
brne .+20       // If Z == 0, we got a bad value, jump to the end
cpi %0, 0x10    // C = 1 if bit is in low nibble
mov r0, %0      // move the bm to r0 again
ldi %0, 0x00    // init counter at 0
brcs .+6        // C == 1, skip over swap
swap r0         // C == 0, swap and skip the low nibble
ldi %0, 0x04    // if we're swap-ing, init counter at 4
sbrc r0, 0
subi %0, 0xFF
sbrc r0, 1
subi %0, 0xFE
sbrc r0, 2
subi %0, 0xFD
sbrc r0, 3
subi %0, 0xFC
rjmp .+2       //
ldi r18, 255    //
: "+d"(uint8_t) mask_or_pos

19/20 clocks for low/high nibble

Alternative 3 (improved timing compared to Nr 1)

mov  r0, %0     // Move the bm to temp reg.
dec r0       //
and r0, %0      // do the N & (N-1) test. If bm is valid, r0 will be 0 and Z = 1
breq .+6         // If Z == 1, we got a good value, skip the abort
ldi %0, 255    // we can save ourself a clock by having this here
rjmp .+18      //
cpi %0, 0x10    // C = 1 if bit is in low nibble
mov r0, %0      // move the bm to r0 again
ldi %0, 0xFF    // init counter at 0-1=255
brcs .+6        // C == 1, skip over swap
swap r0         // C == 0, swap and skip the low nibble
ldi %0, 0x03    // if we're swap-ing, init counter at 4-1=3
lsr r0          //  left-shift, bit0 is moved to C
inc %0          // using pre-inc to save a clock per loop. inc does not affect carry from lsr
brcc .-6        // if bit0 was clear, we need to continue shifting.
: "+d"(uint8_t) mask_or_pos

13/14 + ((0 to 2) * 4) clocks

ALSO: Do not forget that the branches will take +1 clock for 128k Parts, so I'd recomend to take the unrolled loop for consistency. The skip/subi combination always uses two clocks (as it probably just nops the next insn)

SpenceKonde commented 1 year ago

Wait what? Branches take a hit at 128k? I didn't think the br__ instructions ever took longer, and it was only ret/call/ret-like/call-like things that were a problem on large flash, and that happened only on parts with more than 128k of flash, because the program counter addresses the flash by word and runs out of space only at 65536 words - 128kb (notice that the largest modern AVRs so far released have: The maximum flash before we get kicked in the balls with the extra byte of program counter) and 7 ports (the 8th port would have to be a second-class port, because we have no more low IO registers to use for VPORTs.

So I think the DA/DB went to the maximum of what has seen development on the AVRxt instruction set variant. and I'd wager the engineers desperately don't want to stick their hand into that can of "worms" [actually leeches], but at some point management is gonna make em do it, because the currrent m2560 customers are going to demand an upgrade path, or go to some other semiconductor company.

The >128k flash size parts might be located outside of AVR's Zone of Effectiveness, and after seeing how well XMega did there, I'd understand some hesitation to go there.

As far as I can tell, the only punishment that 128k parts receive is the RAMPZ for reading flash with LPM, which we only need to do if we stupidly put data we needed at runtime on the cursed ground between 0x10000 and 0x18000

I believe you're right about how the skipifs skip the next isn (as it explains in the CPU core chapter of the datasheets, AVR's fetch an instruction in one clock cycle, and execute it the next unless control flow has changes. So any time there is a jump , call, skip, branch - anything that causes the the instruction executed the word at address X to be anything other than X + 2 (in bytes) or X + 1 (in words), in which case that pre-fetched instruction is not executed and it needs to point to program counter somewhere else. I suspect that in the case of the skipifs (why are they slower when they skip a 2 word instruction?) they have to look at the first word of the instruction following to figure out that it;'s a double-word one. and at that point they've spent enough time that there're no way to keep it from going to threee clocks.

MX682X commented 1 year ago

Wait what? Branches take a hit at 128k? I didn't think the br__ instructions ever took longer, and it was only ret/call/ret-like/call-like things

oops, my fault. I mixed something up.

(why are they slower when they skip a 2 word instruction?)

Because they don't change the PC, it is just incrementing as if the instructions were executed. And a 2 word instruction has to be incremented twice otherwise it would read the address as an instruction. Making the instruction decoder change the fetched data to a nop is probably the easiest/smallest way to implement that.