Open ElectricRCAircraftGuy opened 9 months ago
Also look into how 64-bit timestamps are obtained on ChibiOS (Ardupilot).
I did the more-complicated solution above so that the get_count()
function would work properly, with counting interrupts and all, inside ISRs!
I don't want to break that. So, what I should probably do instead is make two functions:
get_count()
- normal, fast (use my new approach above)get_count_from_isr()
- ISR-safe; uses my original more-complicated solution above to maintain counting overflows even when called inside an ISR and interrupts are disabled!Ardupilot/ChibiOS uses the _from_ISR()
type functions too. See: https://github.com/ArduPilot/ardupilot/blob/master/libraries/AP_HAL_ChibiOS/hwdef/common/hrt.h#L14:
void hrt_init(void);
uint64_t hrt_micros64(void);
uint64_t hrt_micros64_from_ISR(void); // from an ISR
uint32_t hrt_micros32(void);
uint32_t hrt_millis32(void);
uint32_t hrt_millis32I(void); // from locked context
uint32_t hrt_millis32_from_ISR(void); // from an ISR
uint64_t hrt_millis64(void);
Function definitions are here: https://github.com/ArduPilot/ardupilot/blob/master/libraries/AP_HAL_ChibiOS/hwdef/common/hrt.c
Here's the commit from Andrew Tridgell that speeds up these timestamp calls: https://github.com/ArduPilot/ardupilot/commit/6dbc3b6a70ad15a5d0d80411f408d1fb0e7d6ea9
Here's the pull request: https://github.com/ArduPilot/ardupilot/pull/25790
Your “potential new code” suffers from a race condition, highlighted below:
unsigned long eRCaGuy_Timer2_Counter::get_count()
{
uint8_t SREG_old = SREG;
noInterrupts();
// <-- race condition here! -->
uint8_t tcnt2_save = TCNT2;
unsigned long overflow_count = _overflow_count;
SREG = SREG_old;
_total_count = overflow_count*256 + tcnt2_save;
return _total_count;
}
Consider what happens if the timer overflows right after executing noInterrupts()
. The timer overflow ISR will not run (it will be pending until the end of this critical section), then:
tcnt2_save
will store the post-overflow value of TCNT2
(zero or very close to zero)overflow_count
will still have the pre-overflow count of overflows_total_count
will be off by 256.The vulnerability window would be tiny, maybe even a single CPU cycle. This means the bug would manifest itself very rarely, and would be practically impossible to find in testing.
@edgar-bonet, you are right! It's been 10 years since I first wrote this code, and I think that was part of my original logic in how I arrived at the bug-free and ISR-safe version in-place now and as shown in the top of the first post above.
I am so glad you are watching this! How did you find this post anyway? How long have you been a "watcher" of this repo?
What do you think about this fix for the non-ISR-safe version of get_count()
? I think it would work.
unsigned long eRCaGuy_Timer2_Counter::get_count()
{
uint8_t tcnt2_save1 = TCNT2; // grab the counter value from Timer2
uint8_t SREG_old = SREG;
noInterrupts();
unsigned long overflow_count = _overflow_count; // critical section: 4-byte read on 8-bit processor
// NB: keep this next line inside the interrupt-guarded critical section, not because it needs it
// by itself, but because if someone writes an awful ISR that takes so long *another* rollover could
// occur right after restoring the interrupt state below, I want to not have to worry about counting
// that rollover manually too. 2^8 = 256 counts * 0.5us/count = 128us per overflow.
uint8_t tcnt2_save2 = TCNT2;
SREG = SREG_old; // restore interrupt state
if (tcnt2_save2 < tcnt2_save1)
{
// An overflow just occurred, so manually count it for this calculation.
// The overflow ISR will run and handle it for future calculations.
overflow_count++;
}
// get total Timer2 count
_total_count = (overflow_count << 8) | tcnt2_save2;
return _total_count;
}
Note to self:
For a 32-bit mcu, such as PIC32, counting core timer overflows, the following code might work.
32-bit reads and writes are already atomic, and a 32-bit 100 MHZ core timer overflow occurs only every 2^32 ticks / 100000000 ticks/sec = 42.9 seconds, which is long enough that no interrupt would possibly take that long (unlike my 8-bit timer above where rollover is every 128us).
Potential example code:
// For 32-bit mcus such as PIC32
uint64_t get_counts()
{
uint32_t count1 = get_core_timer();
// No critical section needed because we are reading a 32-bit value, and
// turning off interrupts doesn't stop the core timer from counting
// anyway.
uint32_t rollover_count = get_rollover_count();
uint32_t count2 = get_core_timer();
if (count2 < count1)
{
// rollover just occurred between reading count1 and count2, so manually
// account for it
rollover_count++;
}
uint64_t total_count = ((uint64_t)rollover_count << 32) | count2;
return total_count;
}
My mind is frazzled trying to process this. Here's a question I asked: Stack Overflow: How to properly count timer overflows to convert a 32-bit timer into a 64-bit timer
Note: If there is a tick ISR called at a low tick rate, such as 1ms, this would work too. From my friend JaMc:
uint32_t timer_hi:
uint32_t timer_lo;
// ISR called at every 1ms tick
void timer_isr(void)
{
if (++timer_lo == 0)
{
timer_hi++;
}
}
uint64_t timer_counts(void)
{
uint64_t counts;
__disable_irqs();
counts = timer_lo | ((uint64_t)timer_hi << 32);
__enable_irqs();
return count;
}
On AVR, remove interrupt guards and use my doAtomicRead()
function instead. See: https://stackoverflow.com/a/71625693/4561887
Speed test all this stuff and come up with some automated bench tests for it to look for corrupted timestamps.
I should be able to catch problems within seconds to a few minutes, and a successful overnight run should be conclusive.
How long have you been a "watcher" of this repo?
Probably since I proposed some pull requests.
What do you think about this fix for the non-ISR-safe version of
get_count()
?
This one still has a race condition:
unsigned long eRCaGuy_Timer2_Counter::get_count()
{
uint8_t tcnt2_save1 = TCNT2;
/*--- race here... ---*/
uint8_t SREG_old = SREG;
/*--- ...and here ---*/
noInterrupts();
unsigned long overflow_count = _overflow_count;
uint8_t tcnt2_save2 = TCNT2;
SREG = SREG_old;
if (tcnt2_save2 < tcnt2_save1)
overflow_count++;
_total_count = (overflow_count << 8) | tcnt2_save2;
return _total_count;
}
If the timer overflows either right before or right after you read SREG
, then tcnt2_save1
will be big (255 or close to that), and _overflow_count
will be incremented by the timer overflow ISR. Then you will add an extra increment to overflow_count
.
Yes, this is more tricky that it looks at first sight... :-/
I suggest taking inspiration on the implementation of micros
, which seems pretty solid to me. Based on it, I would suggest this:
unsigned long eRCaGuy_Timer2_Counter::get_count()
{
unsigned long overflow_count;
uint8_t tcnt2_save, tifr2_save;
ATOMIC_BLOCK(ATOMIC_RESTORESTATE) {
overflow_count = _overflow_count;
tcnt2_save = TCNT2;
tifr2_save = TIFR2;
}
if ((tifr2_save & _BV(TOV2)) && (tcnt2_save < 128))
overflow_count++;
return (overflow_count << 8) | tcnt2_save;
}
Note that I have adapted that a little bit:
ATOMIC_BLOCK()
macro from <util/atomic.h>
. It does the same SREG
and interrupt stuff you are already doing, but it makes the programmer's intent clearer.t < 255
by a less stringent test, which will likely be optimized into testing a single bit.come up with some automated bench tests for it to look for corrupted timestamps.
That would be far from trivial. Since a race condition can have a vulnerability window as small as a single CPU cycle, it can easily be missed by any systematic test. If the test calls get_count()
periodically, you must ensure that the period of that call and the timer overflow period are relatively prime. This means that the test period should be an odd number of CPU cycles. Beware that other interrupts may throw these timings off.
Check for rollovers during the reading of the count. Remove the overflow interrupt. Push back an interrupt to perform a count read as far back as possible. Each time a read is performed, push back the interrupt again. This way, if reads are performed frequently enough, no overflow-type interrupt ever needs to run, and if they are not, fewer interrupts occur because they only occur as often as is necessary to not miss overflow counts.
The read function will need to be protected by atomic access guards now because the same overflow count and count_old variables will be written to both inside and outside of an ISR. So, we need to ensure they are atomically updated both inside and outside ISRs to avoid corruption.
Removing the 5us or so overflow ISR which runs every 128us currently would be a win.
Write an empty sketch. Import my library and configure the timer/counter to count at 0.5us/tick.
Write a basic function that manually tracks the overflow and reads the 8-bit counter as fast as it can, printing the result to the serial terminal. Look for patterns of missing timestamps, or gaps in timestamps.
Modify the sketch to collect samples into an array, and only print the array when it is full, to get tighter timestamps closer together.
Add an ISR that calls the timestamp at a fixed interval. Look for race conditions between the main loop code above, and the ISR. There will be no overflow ISR. The overflows will be detected by the count read function.
See if there's some way to overcome the race condition using a "double read" type strategy, rather than disabling interrupts.
See my doAtomicRead()
function here: https://stackoverflow.com/a/71625693/4561887.
Note: for a single producer single consumer, it would work. But, for a multi-producer multi-consumer, which is our case now that both the ISR and main loop context are updating and reading the same variables, I think this will not work, but I want to play with it and prove that to myself anyway.
Addendum: I suggested
// ...
tcnt2_save = TCNT2;
tifr2_save = TIFR2;
// ...
Note that the order of these statements is important. If they get switched, and the timer overflows in-between them... you get the picture.
Current code:
Potential new code (should result in less jitter due to a shorter period of time where interrupts are off):
I need to dig back into the git history though and see if that was a prior, broken solution I already had before.
Update: I don't have the git history to know. I'll have to dig up the copy/pasted dirs in my old computer's files instead. The first commit I have with the code is
d6987d8
, and it already had the bulk of the function content above in it.The new code will produce a slightly "older" timestamp, but who cares. It should have less jitter, shorter time with interrupts off, and probably even less latency due to being faster.