contrem / arduino-timer

Non-blocking library for delaying function calls
BSD 3-Clause "New" or "Revised" License
331 stars 50 forks source link

Timer cancels itself? #76

Closed adelin-mcbsoft closed 1 year ago

adelin-mcbsoft commented 1 year ago

Is there any circumstance or technical possibility that could lead to a timer canceling itself?

I have few .every() timers and sometimes (very rarely indeed), with no possible explanation a random timer simply stops. I just can't figure out why.

Their function always return true; , there's plenty of RAM available, I had set plenty of timers (Timer<48> MainTimer;), I run the loop with MainTimer.tick<void>(); // to avoid ticks() calculation , I count them and I have a total of 15... but one of them, very rarely, just suddenly stops.

Can't figure out why.

Board is ATMega4808 if it helps.

philj404 commented 1 year ago

I'm not sure I can help, sorry.

It looks like you have covered the main point of failure I know about since your tasks return true; (for all return paths).

Typically I just call timer.tick(); and don't worry about the extra ticks() calculation. You might want to just try that (since it is easy), but it probably will not make a difference.

Is your timer dealing with dynamic temporary tasks a lot?

Is there a particular task that blocks or just takes a lot of time? Afterwards the MainTimer may need to flush a lot of pending tasks. That's not necessarily a problem, but a lot of tasks will be busy catching up all at once.

The order pending tasks will run is not specified (it's not a priority queue). Maybe a race condition causes some trouble sometimes.

Other bugs I have seen (in my own code) are mostly with general C/C++ issues.

It might help to add a "watchdog" task which monitors the other repetitive tasks and prints debug info if they disappear. What other task finished just before timer.size() lost a task?

adelin-mcbsoft commented 1 year ago

The one that I have no idea how to test is if a bad pointer is corrupting the task table... However, I have studied the library's source code myself, on and on, and I can't find a bug either.

I'm using MegaCoreX with Nano-4808.

If I do an overall analyze... I observed that it tends to happen more often when the Serial connection is bad though I don't know how can that be connected in any way with the timers.

By "bad" I mean a long/low-quality USB cable or bad connection when using USB over IP. But once again, can't figure out how can that be connected in any way to the issue.

Now I'm monitoring the timer IDs to see if a timer changes to ID 0 whenever it gets canceled (which would mean that the execution of a function gets canceled midway and no return true is provided to the library -- though I have no idea how it can be possible to get canceled midway)...

"Sadly" in the last 3 days, a single timer ceased to function and that one was printing the debug information (over Serial)... so, I got nothing. Fail.

This is a complete mystery.

philj404 commented 1 year ago

For the record I hate intermittent problems.

Hmm. The debug task failed.

If Serial often has trouble at the same time, that could be a clue -- maybe this is not a coincidence.

Other general techniques to try:

adelin-mcbsoft commented 1 year ago

So, it happened that another timer got canceled.

I can confirm that when a timer "gets canceled", it is not actually deleted from the tasks lists, meaning it's not literally canceled using the .remove() private method. The timer's ID is still preserved, does NOT get ID = 0.

However the function that handles the timer is no longer called. I guess it does not hang either, since all the other timers and the program continues... so I suppose the _timer_foreach_task structure simply ignores it somehow?

#define _timer_foreach_task(T, task) \
    for (T task = tasks; task < tasks + max_tasks; ++task)

Still trying to figure out how that's possible...

philj404 commented 1 year ago

I suspect it is not the timer library that is what makes tasks stop.

mla2 commented 1 year ago

I think I know why this happens. The task_id function does this: return id ? id ^ t->id : id; , where id is really the pointer to the task and t->id is a counter that gets incremented. If 2 tasks are close to each other (in memory), there can be 2 tasks of the same id, if the counter diff matches the pointer offset. Then the wrong task gets cancelled at times. It was driving me insane until I realised this today. I just changed it to return id and it seems stable after that. That's my theory for now at least

adelin-mcbsoft commented 1 year ago

Sorry to put it like this -- but I'm happy that someone else has the same problem I have 😆 I was beginning to lose faith, starting to think that it may be something with the hardware circuit board which corrupts the memory or something.

Anyway, the bad thing is that is very hard to simulate this bug. I added some watchdogs which restart certain timers when they fail... sometimes they go for days without failing, and other times fail few minutes after program start.

If the other bug I reported is resolved ( #77 ) I may be able to confirm if the task fails due to the case you described, or if it's something else.

@mla2 do you think you have a chance to make a fork and commit your changes, so I can also test them? Thanks!

mla2 commented 1 year ago

@adelin-mcbsoft No problem, this is what I did https://github.com/mla2/arduino-timer/commit/8dc1f1d0286668cc39a64798f1c6bd08259d91f7 I'm not sure it fits the design of how the task-id's is supposed to be used, so I'm a bit hesitant to make a PR before the author of the library steps in. But I'm pretty sure the current code is problematic, and this resolved my issues.

philj404 commented 1 year ago

@mla2 you may have found an interesting root cause there.

The timer library uses a size_t ctr task counter which increments each time a task is added. Since it is size_t, it will wrap periodically -- probably after 64k tasks are added (for 16 bit size_t) for an Arduino UNO. Processors with 32 or 64 bits size_t will have fewer chances for collision.

So if you are adding/cancelling tasks 100 times/second... you might have an ID collision every 650 seconds -- or about every eleven minutes.

Are you adding/cancelling tasks a lot?

Perhaps the returned Task id (for cancel) needs more entropy than size_t can really provide. Is there a way to do this without increasing the size of the timer's RAM footprint?

I think this has not been a problem for me because I am not adding that many tasks, and cancelling them even less often.

mla2 commented 1 year ago

I think it might be a lot worse than that. The tasks are just an array in the timer-lib.

So if the first task have an address of say 0x00010000 and the size if something like 24 bytes (depending on machine type) so the next task will have an address off 0x00010018.

If the task at 0x00010018 is setup to fire continuously .every( ... ) and the ctr at that point is say 8, the task_id() will be 0x00010020. And it will run happily..

Then we start to arm and cancel the first timer, it maybe have an id of say 0x00010002 on the first pass.. it get's cancelled. Next time it gets armed the task_id() will be 0x00010009 (because the ctr is 9).. A few more arm/cancel cycles and it will end up with an id of 0x00010020 and boom, the wrong task is cancelled. This is because the task_id() is Pointer XOR Counter.

Does this makes any sense to you, or does it just sound like incoherent mumbling to you? :)

The reason it will happen to so few projects is that it requires a special pattern of cancelling and arming tasks (which my code do)

philj404 commented 1 year ago

@mla2 It makes sense to me.

The ctr++ for the next hash value is probably going to change only 1 + ( 1/2+1/4+1/8+1/16...+1/32768) bits, or on average only the two LSBs change.

And since the bits likely to change are the LSBs, they're very similar to the same bits used to access the tasks[] array.

philj404 commented 1 year ago

One workaround could be to replace ctr++; with a pseudo-random number generator. This would reduce the probability of an incorrect match, but not eliminate it.

Also it adds more execution overhead than ctr++; has.

Note the only thing you can do with a saved Task is call timer.cancel() later.

Another alternative would be that the returned Task has enough information to recognize it is exactly referring to a particular task[] entry.

The task to delete would need to check:

...before timer.cancel(aTask); recognizes it found the correct Task to cancel.

This adds some RAM overhead when saving a Task to be cancelled. The id value might still collide sometimes... but only when really thrashing on the same handler and opaque.

mla2 commented 1 year ago

It still sound contrived to me.. but I don't really know what the design goals are. To me just using the pointer of the task as an id is enough (which is what my patch do, or that's what I think it does). Of course there cases where you might want some randomization of id:s but rarely in a microcontroller. I also prefer that things fails early and not in a 100 days.. In the systems I work with, you never know how long they are going to be powered on, a day, a month, a year? Hard to tell.

Anyways, I'm happy with my patch, and is only sharing my findings as a semi good internet citizen. :)

contrem commented 1 year ago

The reason for the current task id implementation is described here (114e4046dcfac557daec620d9b3697a34617b70b).

@philj404 Random generation of id has the same probability of collision as the current implementation for the second task, but increasing probability of collision with every added active task, and without the guarantee of protecting active id's. Checking for handler is a good thought, but most of the time they're constant for a program. handler addresses usually don't change, so a new timer with the same handler would pass the check. opaque values may be worth adding.

@mla2 Your patch will fail under the scenario described in the commit.

I think it might be a lot worse than that. The tasks are just an array in the timer-lib.

This shouldn't be an issue. A task_id is the xor of a counter and the slot address. An example of what you describe:

task_id_to_check = 0 ^ 1 = 1 // slot = 0, ctr = 1

say slot[1], slot 0 + 1, is occupied and the counter wrapped so it also has id 1 => the worst case.
slot_1_task_id = 1 ^ 1 = 0 // slot = 1, ctr = 1

task_id_to_check == slot_1_task_id => 1 == 0 => false, no collision

ctr can never be 0. Even if the counter was only 1 bit, 0 and 1, it can never be 0, and so is always 1, there wouldn't be collision because the slot values are different. This essentially degrades to your implementation, and would then fail in the scenario described in the commit. Fortunately the counter size is not going to be 1 bit.

Maybe I'm mistaken here, please provide an example if you can.

Now, as described in the commit message, there is a chance (1 / 2^BITS) that at the limit of the counter size, a wrap of the addition results in a collision, but only for stale task id's (tasks that already ended / are already canceled). valid task id's should never collide. If you call cancel with a stale id, then you're at risk for this collision.

I avoided adding cancellation to the library as long as I could to prevent these complications. Even with its addition, it is still up to the user to maintain a valid list of non stale task ids. This is the bane of resource management in lower level languages.

It could be improved in the lib, but at the cost of memory and simplicity.

What would really help here is some minimal code that reproduces the issue you're facing.

philj404 commented 1 year ago

I think I have some example code which shows... interesting symptoms anyway.

I made a local copy of <arduino-timer.h> so I could modify it. The biggest change was that I commented out the Timer private: declaration so I could access internal structures from my test code.

Basically my code just creates a few static tasks, then it creates a few more tasks to destroy and re-create over and over. This is to see if there are any Task collisions. I'm hoping the first few tasks catch long-term issues, and the cancel/recreate cycles find short-term issues.

Here is my code:

#include "arduino-timer.h" // (my local copy to access internal variables)

auto timer = timer_create_default();  // create a timer with default settings

// a generic task
bool toggle_led(void*) {
  digitalWrite(LED_BUILTIN, !digitalRead(LED_BUILTIN));  // toggle the LED
  return true;                                           // repeat? true
}

struct TaskInfo {
  Timer<>::Task id;  // id to use to cancel
  unsigned long instanceNumber;
};

static const int numTaskIds = 20;
TaskInfo tasksToCancel[numTaskIds] = { 0 };

unsigned long creationAttempts = 0;

void createTask(int index) {
  TaskInfo& taskInfo = tasksToCancel[index];

  if (taskInfo.id != (Timer<>::Task)NULL) {
    timer.cancel(taskInfo.id);
  }
  auto newId = timer.every(250, toggle_led);
  ++creationAttempts;

  if (newId == 0) {
    Serial.print(F("timer task creation failure on creation number "));
    Serial.println(creationAttempts);
    //while (true) {};

  } else {

    // check for collisions before saving taskInfo
    for (int i = 0; i < numTaskIds; i++) {
      const TaskInfo& ti = tasksToCancel[i];
      if (ti.id == newId) {
        Serial.print(F("COLLISION FOUND! instance number: "));
        Serial.print(creationAttempts);
        Serial.print(F(" hash 0x"));
        Serial.print(newId, HEX);
        Serial.print(F(" "));
        Serial.print(newId, BIN);

        Serial.print(F(" matches hash for instance number: "));
        Serial.println(ti.instanceNumber);
      }
    }
    taskInfo.id = newId;
    taskInfo.instanceNumber = creationAttempts;

    static const unsigned long reportCountTime = 10000;
    if (creationAttempts % reportCountTime == 0) {
      Serial.print(creationAttempts / 1000);
      Serial.println(F("k tasks created."));
    }
  }
}

void setup() {
  delay(1000);         // wait for stability on some boards to prevent garbage Serial
  Serial.begin(9600);  // ESP8266 default of 74880 not supported on Linux
  //Serial.begin(115200);  // ESP8266 default of 74880 not supported on Linux
  while (!Serial)
    ;  // for the Arduino Leonardo/Micro only

  Serial.println(F("Running " __FILE__ ",\nBuilt " __DATE__));

  pinMode(LED_BUILTIN, OUTPUT);  // set LED pin to OUTPUT
  // call the toggle_led function every 1000 millis (1 second)
  //timer.every(1000, toggle_led);

  Serial.print(F("Size of task: "));
  Serial.print(sizeof(Timer<>::task));
  Serial.println(F(" bytes"));

  Serial.print(F("Size of tasks[] array: "));
  Serial.print(sizeof(Timer<>::tasks));
  Serial.println(F(" bytes"));

  Serial.print(F("location of tasks[] array: "));
  Serial.print((unsigned long)&timer.tasks, BIN);
  Serial.print(F(" hex: "));
  Serial.println((unsigned long)&timer.tasks, HEX);

  // timer capacity is 0x10 -- stay below
  // load up some static tasks
  for (int i = 0; i < 6; i++) {
    createTask(i);
  }

  // cancel/recreate tasks
  for (unsigned long groups = 0; groups < 20000000UL; groups++) {
    for (int i = 9; i < 0x10; i++) {
      createTask(i);
    }
  }
  Serial.println(F("Done."));
}

void loop() {
  timer.tick();  // tick the timer
}

Here is the output my code generates on an Arduino Nano:

Running C:\Users\philk\OneDrive\Documents\Arduino\arduTimerCancelTest\arduTimerCancelTest.ino,
Built Feb  4 2023
Size of task: 16 bytes
Size of tasks[] array: 256 bytes
location of tasks[] array: 1000111010 hex: 23A
COLLISION FOUND! instance number: 118 hash 0x28C 1010001100 matches hash for instance number: 6
COLLISION FOUND! instance number: 161 hash 0x23B 1000111011 matches hash for instance number: 1
COLLISION FOUND! instance number: 164 hash 0x26E 1001101110 matches hash for instance number: 4
COLLISION FOUND! instance number: 165 hash 0x27F 1001111111 matches hash for instance number: 5
COLLISION FOUND! instance number: 182 hash 0x28C 1010001100 matches hash for instance number: 6
COLLISION FOUND! instance number: 212 hash 0x26E 1001101110 matches hash for instance number: 4
timer task creation failure on creation number 634
timer task creation failure on creation number 650
10k tasks created.
20k tasks created.
30k tasks created.
40k tasks created.
50k tasks created.
60k tasks created.
COLLISION FOUND! instance number: 65586 hash 0x259 1001011001 matches hash for instance number: 3
COLLISION FOUND! instance number: 65604 hash 0x27F 1001111111 matches hash for instance number: 5
COLLISION FOUND! instance number: 65637 hash 0x28C 1010001100 matches hash for instance number: 6
COLLISION FOUND! instance number: 65714 hash 0x259 1001011001 matches hash for instance number: 3
COLLISION FOUND! instance number: 65731 hash 0x26E 1001101110 matches hash for instance number: 4
COLLISION FOUND! instance number: 65732 hash 0x27F 1001111111 matches hash for instance number: 5
COLLISION FOUND! instance number: 65874 hash 0x259 1001011001 matches hash for instance number: 3
70k tasks created.
80k tasks created.
90k tasks created.
100k tasks created.
110k tasks created.
120k tasks created.
130k tasks created.
COLLISION FOUND! instance number: 131219 hash 0x27F 1001111111 matches hash for instance number: 5
timer task creation failure on creation number 131800
140k tasks created.
150k tasks created.
160k tasks created.
170k tasks created.
180k tasks created.
190k tasks created.
COLLISION FOUND! instance number: 196656 hash 0x259 1001011001 matches hash for instance number: 3
COLLISION FOUND! instance number: 196750 hash 0x23B 1000111011 matches hash for instance number: 1
COLLISION FOUND! instance number: 196834 hash 0x27F 1001111111 matches hash for instance number: 5
timer task creation failure on creation number 197207
timer task creation failure on creation number 197214
timer task creation failure on creation number 197221
timer task creation failure on creation number 197228
timer task creation failure on creation number 197235
timer task creation failure on creation number 197242
timer task creation failure on creation number 197249
timer task creation failure on creation number 197256
timer task creation failure on creation number 197263
...

OK, so unexplained issues:

philj404 commented 1 year ago

So I made a change to Timer::task_id():

  inline Task
  task_id(const struct task *const t) {
    const Task id = reinterpret_cast<Task>(t);

#define NEW_HASH
#ifdef NEW_HASH
    return id ? t->id : id;
#else
    return id ? id ^ t->id : id;
#endif
  }

This made the output more consistent.

Running C:\Users\philk\OneDrive\Documents\Arduino\arduTimerCancelTest\arduTimerCancelTest.ino,
Built Feb  4 2023
Size of task: 16 bytes
Size of tasks[] array: 256 bytes
location of tasks[] array: 1000111010 hex: 23A
10k tasks created.
20k tasks created.
30k tasks created.
40k tasks created.
50k tasks created.
60k tasks created.
COLLISION FOUND! instance number: 65536 hash 0x1 1 matches hash for instance number: 1
COLLISION FOUND! instance number: 65537 hash 0x2 10 matches hash for instance number: 2
COLLISION FOUND! instance number: 65538 hash 0x3 11 matches hash for instance number: 3
COLLISION FOUND! instance number: 65539 hash 0x4 100 matches hash for instance number: 4
COLLISION FOUND! instance number: 65540 hash 0x5 101 matches hash for instance number: 5
COLLISION FOUND! instance number: 65541 hash 0x6 110 matches hash for instance number: 6
70k tasks created.
80k tasks created.
90k tasks created.
100k tasks created.
110k tasks created.
120k tasks created.
130k tasks created.
COLLISION FOUND! instance number: 131071 hash 0x1 1 matches hash for instance number: 1
COLLISION FOUND! instance number: 131072 hash 0x2 10 matches hash for instance number: 2
COLLISION FOUND! instance number: 131073 hash 0x3 11 matches hash for instance number: 3
COLLISION FOUND! instance number: 131074 hash 0x4 100 matches hash for instance number: 4
COLLISION FOUND! instance number: 131075 hash 0x5 101 matches hash for instance number: 5
COLLISION FOUND! instance number: 131076 hash 0x6 110 matches hash for instance number: 6
140k tasks created.
150k tasks created.
160k tasks created.
170k tasks created.
180k tasks created.
190k tasks created.
COLLISION FOUND! instance number: 196606 hash 0x1 1 matches hash for instance number: 1
COLLISION FOUND! instance number: 196607 hash 0x2 10 matches hash for instance number: 2
COLLISION FOUND! instance number: 196608 hash 0x3 11 matches hash for instance number: 3
COLLISION FOUND! instance number: 196609 hash 0x4 100 matches hash for instance number: 4
COLLISION FOUND! instance number: 196610 hash 0x5 101 matches hash for instance number: 5
COLLISION FOUND! instance number: 196611 hash 0x6 110 matches hash for instance number: 6
200k tasks created.
210k tasks created.
220k tasks created.
...

-- no creation failures, and the collisions are consistently 64k creations apart. Increasing the size of Task from size_t to unsigned long would probably help reduce the probability of collision.

It would be nice to eliminate any chance of collision, as after all there are relatively few task IDs really in play at any time (compared to running 65000 tasks running at the same time).

One way to avoid a collision would be to check if the generated ID is already in use. A new task ID candidate could search through timer.tasks[] to see if it is in use, and if so increment and try again. That's adding extra overhead to each new task though.

mla2 commented 1 year ago

Ok, I had to make some small hacks to make it compile on the host. I force the tasks to be aligned on a certain boundary to easily show the problem and removed some include path see here : https://github.com/mla2/arduino-timer/commit/38455334f7ce3d9b53aae017e592d6f5fa039999 Then I created a small test program that looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <cstdint>
#include "arduino-timer.h"

auto timer = Timer<2, []() -> unsigned long {
  static int milli = 0;
  return ++milli;
}>();

int main(int argc, char *argv[]) {
  // Create task 1 to steal the first slot first
  auto t1 = timer.at(1000L, [](void*) { printf("I am t1\n"); return true; });
  // task 2 ends up in second slot
  auto t2 = timer.at(1000L, [](void*) { printf("I am t2 OH NO!\n"); return true; });
  printf("t2 0x%016lx\n", t2);

  for (int i = 0; i < 64; i++) {
    // cancels the t1 timer which frees the first slot in the task list
    timer.cancel(t1);
    // Creates a new task in first slot which xor:s in ctr++
    t1 = timer.at(1000L, [](void*) {  printf("I am t1\n"); return true; });
    if (t1 == t2) {
      printf("on iteration %d: tasks t1 0x%016lx and t2 0x%016lx, id collision\n", i,  t1, t2);
      printf("cancelling t2\n");
      timer.cancel(t2);
      break;
    }
  }
  for (int i = 0; i < 5000; i++)
    timer.tick();
}

compile with g++ -std=c++20 -o main testprog.cpp When I run this program, this happens:

mla@LAPTOP-BGH5GDU3:~/bugstuff$ ./main
t2 0x0000555f63559032
on iteration 47: tasks t1 0x0000555f63559032 and t2 0x0000555f63559032, id collision
cancelling t2
I am t2 OH NO!

As you clearly can see we have an id-collision here, which exactly what I see on my embedded system at times.

Depending on what your toolchain generates, the task size will be different so the iteration count may wary. I hope this clearly demonstrate the current code is very problematic in some cases.

That said, I see why my patch is problematic as a general solution. It works in my case because of how I use the library..

contrem commented 1 year ago

@mla2 yep - the check in cancel is incorrect. it should be task_id ^ task == task->id which would make these non-collisions.

We've also discovered another problem -> 0 is a valid task id. This breaks the assumptions of the task_id function, and the return value of all the timer add functions.

We have two options - fix the check and make sure task_id is never 0, or revert the patch that added the counter.

At the moment, I'm for reverting the counter - I only added it to try and help prevent stale id cancellation (the example in the commit message), but clearly it's more problematic than it's worth.

mla2 commented 1 year ago

I'm in no rush, so for me my problem is solved.. I guess it depends what the "community" wants? I'm not a regular in arduino community, so either sounds fine to me. Reverting and and document that you need to clear the task-handle in the callback on one time firing callbacks (I think that is safe?) maybe?

philj404 commented 1 year ago

@contrem Yes, rather than try to make the timer bulletproof it might be best to just say "Do not call timer.cancel( ) with a stale Task (that was already cancelled or finished execution)."

It might still be nice to null out the Task passed in to timer.cancel(), and ignore the NULL if it is ever passed in. The only time a Task could legitimately be 0 is if timer.tasks[0] were allocated at address 0 -- is that possible for some architecture?

philj404 commented 1 year ago

An easy way to make sure task_id is not zero is to see if it is zero, increment ctr and try again.

But that doesn't solve the collision problem.

contrem commented 1 year ago

The two fix options are at fix/revert-task-counter and fix/cancel

contrem commented 1 year ago

The only time a Task could legitimately be 0 is if timer.tasks[0] were allocated at address 0 -- is that possible for some architecture?

@philj404 if the architecture allows this, then the compiler is supposed to make the NULL value something else.

philj404 commented 1 year ago

I tried https://github.com/contrem/arduino-timer/compare/master...fix/revert-task-counter and https://github.com/contrem/arduino-timer/tree/fix/cancel with my test code on an Arduino Nano.

(EDIT: I was mistaken -- I only really checked fix/cancel. See later comments.)

Since timer.cancel() returns after removing the first task it matches, I suspect it deleted the wrong task in my case (because I created the long-term tasks first).

philj404 commented 1 year ago

I think we want to use ctr++, or the slot address, but not xor mix them together.

Note slot pointers in the tasks[] array are really only going to have a few unique values -- for 16 slots by default. Slot addresses will have many identical bits. For example with 16-byte tasks, the bottom 4 bits will always be identical.

I suppose we could reserve a variable number of bits for an array index, and use the rest for a unique counter, just to make recognizing an incorrect timer.cancel() later... but it sounds like a lot of complexity for limited benefit. And it's really hard to test/prove it works correctly.

philj404 commented 1 year ago

(EDIT: fix/revert-task-counter looks promising.)

I tested fix/revert-task-counter and fix/cancel with my test code on an Arduino Nano.

The new test does a double-cancel. It calls timer.cancel() twice using the stale id for the second cancel().

On both branches, the second cancel() finds and removes another task. EDIT:

Below is output for fix/cancel.

Running C:\Users\philk\OneDrive\Documents\Arduino\arduTimerCancelTest\arduTimerCancelTest.ino,
Built Feb  6 2023
Timer now has 1 tasks
Timer now has 2 tasks
Timer now has 3 tasks
Timer now has 4 tasks
Timer now has 5 tasks
Timer now has 6 tasks
Timer now has 7 tasks
Timer now has 8 tasks
Timer now has 9 tasks
Timer now has 10 tasks
Timer now has 11 tasks
Timer now has 12 tasks
Timer now has 13 tasks

(At this point the timer should have a consistent 13 active tasks)

COLLISION FOUND! instance number: 118 hash 0x28A 1010001010 matches hash for instance number: 6
Timer now has 12 tasks
COLLISION FOUND! instance number: 161 hash 0x23D 1000111101 matches hash for instance number: 1
COLLISION FOUND! instance number: 164 hash 0x268 1001101000 matches hash for instance number: 4
COLLISION FOUND! instance number: 165 hash 0x279 1001111001 matches hash for instance number: 5
Timer now has 11 tasks
Timer now has 10 tasks
Timer now has 9 tasks
COLLISION FOUND! instance number: 182 hash 0x28A 1010001010 matches hash for instance number: 6
COLLISION FOUND! instance number: 226 hash 0x24E 1001001110 matches hash for instance number: 2
Timer now has 8 tasks
COLLISION FOUND! instance number: 243 hash 0x25F 1001011111 matches hash for instance number: 3
Timer now has 7 tasks
10k tasks created.
20k tasks created.
30k tasks created.
40k tasks created.
50k tasks created.
60k tasks created.

(number of tasks is constant -- all the static tasks have been removed by now)

For fix/revert-task-counter the output is better behaved:

Running C:\Users\philk\OneDrive\Documents\Arduino\arduTimerCancelTest\arduTimerCancelTest.ino,
Built Feb  6 2023
Timer now has 1 tasks
Timer now has 2 tasks
Timer now has 3 tasks
Timer now has 4 tasks
Timer now has 5 tasks
Timer now has 6 tasks
Timer now has 7 tasks
Timer now has 8 tasks
Timer now has 9 tasks
Timer now has 10 tasks
Timer now has 11 tasks
Timer now has 12 tasks
Timer now has 13 tasks
10k tasks created.
20k tasks created.
30k tasks created.
40k tasks created.
50k tasks created.
60k tasks created.
...

Here's my test for Arduino Nano:

//#include "arduino-timer.h"  // (my local copy to access internal variables)
#include <arduino-timer.h>

auto timer = timer_create_default();  // create a timer with default settings

// a generic task
bool toggle_led(void*) {
  digitalWrite(LED_BUILTIN, !digitalRead(LED_BUILTIN));  // toggle the LED
  return true;                                           // repeat? true
}

struct TaskInfo {
  Timer<>::Task id;  // id to use to cancel
  unsigned long instanceNumber;
};

static const int numTaskIds = 20;
TaskInfo tasksToCancel[numTaskIds] = { 0 };

unsigned long creationAttempts = 0;

void createTask(int index) {
  TaskInfo& taskInfo = tasksToCancel[index];

  if (taskInfo.id != (Timer<>::Task)NULL) {
    auto staleId = taskInfo.id;
    timer.cancel(taskInfo.id);
    timer.cancel(staleId); // double cancel hits another task?
  }

  auto newId = timer.every(250, toggle_led);
  ++creationAttempts;

  static size_t timerSize = 0;
  auto newSize = timer.size();
  if (timerSize != newSize) {
    timerSize = newSize;
    Serial.print(F("Timer now has "));
    Serial.print(timerSize);
    Serial.println(F(" tasks"));
  }

  if (newId == 0) {
    Serial.print(F("timer task creation failure on creation number "));
    Serial.println(creationAttempts);
    //while (true) {};

  } else {

    // check for collisions before saving taskInfo
    for (int i = 0; i < numTaskIds; i++) {
      const TaskInfo& ti = tasksToCancel[i];
      if (ti.id == newId) {
        Serial.print(F("COLLISION FOUND! instance number: "));
        Serial.print(creationAttempts);
        Serial.print(F(" hash 0x"));
        Serial.print(newId, HEX);
        Serial.print(F(" "));
        Serial.print(newId, BIN);

        Serial.print(F(" matches hash for instance number: "));
        Serial.println(ti.instanceNumber);
      }
    }
    taskInfo.id = newId;
    taskInfo.instanceNumber = creationAttempts;

    static const unsigned long reportCountTime = 10000;
    if (creationAttempts % reportCountTime == 0) {
      Serial.print(creationAttempts / 1000);
      Serial.println(F("k tasks created."));
    }
  }
}

void setup() {
  delay(1000);         // wait for stability on some boards to prevent garbage Serial
  Serial.begin(9600);  // ESP8266 default of 74880 not supported on Linux
  //Serial.begin(115200);  // ESP8266 default of 74880 not supported on Linux
  while (!Serial)
    ;  // for the Arduino Leonardo/Micro only

  Serial.println(F("Running " __FILE__ ",\nBuilt " __DATE__));

  pinMode(LED_BUILTIN, OUTPUT);  // set LED pin to OUTPUT
  // call the toggle_led function every 1000 millis (1 second)
  //timer.every(1000, toggle_led);

#ifdef SHOW_INTERNAL_VARIABLES
  // show internal variables (when timer has nothing private)
  Serial.print(F("Size of task: "));
  Serial.print(sizeof(Timer<>::task));
  Serial.println(F(" bytes"));

  Serial.print(F("Size of tasks[] array: "));
  Serial.print(sizeof(Timer<>::tasks));
  Serial.println(F(" bytes"));

  Serial.print(F("location of tasks[] array: "));
  Serial.print((unsigned long)&timer.tasks, BIN);
  Serial.print(F(" hex: "));
  Serial.println((unsigned long)&timer.tasks, HEX);
#endif

  // timer capacity is 0x10 -- stay below
  // load up some static tasks
  for (int i = 0; i < 6; i++) {
    createTask(i);
  }

  // cancel/recreate tasks
  for (unsigned long groups = 0; groups < 20000000UL; groups++) {
    for (int i = 9; i < 0x10; i++) {
      createTask(i);
    }
  }
  Serial.println(F("Done."));
}

void loop() {
  timer.tick();  // tick the timer
}
philj404 commented 1 year ago

I beg your pardon! It looks like fix/revert-task-counter works better than I had claimed! (I checked out the branch incorrectly)

But fix/cancel still has id collisions.

contrem commented 1 year ago

Looks like we'll go with reverting. Thanks for testing it out @philj404. Thanks @mla2 and @adelin-mcbsoft for bringing up the issue and providing examples.

adelin-mcbsoft commented 1 year ago

Thank you as well, @contrem . Having you actively maintaining this library is blissful.

By the way, I don't want to hijack the topic with a new thing, but since my proposal is a small change, I don't know if it's worth opening a new issue.

I was thinking, since you're already making (somewhat breaking[1]) changes to the library now and issuing a new version, can we consider making the cancel(Task &task) method a bool which would return true if the task is found and got canceled, and false otherwise?

/* Cancel the timer task */
    bool
    cancel(Task &task)
    {
        struct task * const t = static_cast<struct task * const>(task);
        if (t) {
          remove(t);
          task = static_cast<Task>(NULL);
          return true;
        } else {
          return false;
        }

    }

It is useful to know if a timer was found (and got indeed canceled) or not. (e.g. perhaps you're trying to cancel and non-existing ID and so on...)


Notes: [1] - I said "somewhat breaking changes" as anybody who had a Serial.print(TaskID) in the code, will now receive a compiling error since the ID is no longer (casted as) an int.

contrem commented 1 year ago

Fix and suggested enhancement is in next and will be released with version 3.0.0

philj404 commented 1 year ago

Thanks for the update!

For fun I created a unit test which can detect that master and fix/cancel have some issues, but fix/revert-task-counter does not have them -- it does not (seem to) have ID collisions, return NULL ids or have problems with double-cancel. That's all I test for.

The test can detect problems, but it can be hard to interpret the what caused a failure. The biggest advantage is that even a build robot can recognize the failure.

I don't think I would want to include it in regular regression testing because:

... But... if you want to play with a regression test, here's an example. 😁

// arduino-timer unit tests
// Arduino "AUnit" library required

// Required for UnixHostDuino emulation
#include <Arduino.h>

#if defined(UNIX_HOST_DUINO)
#ifndef ARDUINO
#define ARDUINO 100
#endif
#endif

#include <AUnit.h>
#include <arduino-timer.h>

//#define DEBUG
#ifdef DEBUG
#define DEBUG_PRINT(x) Serial.print(x)
#define DEBUG_PRINTLN(x) Serial.println(x)
#else
#define DEBUG_PRINT(x)
#define DEBUG_PRINTLN(x)
#endif

auto timer = timer_create_default();  // create a timer with default settings

// a generic task
bool dummyTask(void*) {
  //digitalWrite(LED_BUILTIN, !digitalRead(LED_BUILTIN));  // toggle the LED
  return true;  // repeat? true
}

struct TaskInfo {
  Timer<>::Task id;  // id to use to cancel
  unsigned long instanceNumber;
};

static const int numTaskIds = 20;

TaskInfo tasksToCancel[numTaskIds] = { 0 };

unsigned long creationAttempts = 0;

bool createTask(int index) {
  bool successful = true;  // OK so far
  TaskInfo& taskInfo = tasksToCancel[index];

  if (taskInfo.id != (Timer<>::Task)NULL) {
    // cancel task in slot
    auto staleId = taskInfo.id;
    int beforeSize = (int)timer.size();
    timer.cancel(taskInfo.id);
    int afterSize = (int)timer.size();
    successful &= (afterSize == beforeSize - 1);
    if (!successful) {
      Serial.println(F("could not cancel a task"));
    } else {

      timer.cancel(staleId);  // double cancel should not hit another task
      int afterSize2 = (int)timer.size();
      successful &= (afterSize2 == beforeSize - 1);
      if (!successful) {
        Serial.println(F("second cancel removed another task"));
      }
    }
  }

  auto newId = timer.every(250, dummyTask);
  ++creationAttempts;

  static size_t timerSize = 0;
  auto newSize = timer.size();
  if (timerSize != newSize) {
    timerSize = newSize;
    DEBUG_PRINT(F("Timer now has "));
    DEBUG_PRINT(timerSize);
    DEBUG_PRINTLN(F(" tasks"));
  }

  if (newId == 0) {
    Serial.print(F("timer task creation failure on creation number "));
    Serial.println(creationAttempts);
    successful = false;

  } else {

    // check for collisions before saving taskInfo
    for (int i = 0; i < numTaskIds; i++) {
      const TaskInfo& ti = tasksToCancel[i];
      if (ti.id == newId) {
        successful = false;
        Serial.print(F("COLLISION FOUND! instance number: "));
        Serial.print(creationAttempts);
        Serial.print(F(" hash 0x"));
        Serial.print(((size_t)newId, HEX));
        Serial.print(F(" "));
        Serial.print(((size_t)newId, BIN));

        Serial.print(F(" matches hash for instance number: "));
        Serial.println(ti.instanceNumber);
      }
    }
    taskInfo.id = newId;
    taskInfo.instanceNumber = creationAttempts;

    static const unsigned long reportCountTime = 10000;
    if (creationAttempts % reportCountTime == 0) {
      DEBUG_PRINT(creationAttempts / 1000);
      DEBUG_PRINTLN(F("k tasks created."));
    }
  }
  return successful;
}

test(timerMultiCancel) {
  timer.cancel();  // ensure timer starts empty
  assertEqual((int)timer.size(), 0);
  creationAttempts = 0;

  // timer capacity is 0x10 -- stay below
  // load up some static tasks
  for (int i = 0; i < 6; i++) {
    assertTrue(createTask(i));
  }

  assertEqual((int)timer.size(), 6);

  // cancel/recreate tasks
  for (unsigned long groups = 0; groups < 30000UL; groups++) {
    for (int i = 9; i < 0x10; i++) {
      assertTrue(createTask(i));
      if (groups > 0) {
        // should be steady-state task size now
        assertEqual((int)timer.size(), 13);
      }
    }
  }
  Serial.print(creationAttempts);
  Serial.println(F(" tasks created."));
}

void sketch(void) {
  Serial.println();
  Serial.println(F("Running " __FILE__ ", Built " __DATE__));
}

void setup() {
  ::delay(1000UL);         // wait for stability on some boards to prevent garbage Serial
  Serial.begin(115200UL);  // ESP8266 default of 74880 not supported on Linux
  while (!Serial)
    ;  // for the Arduino Leonardo/Micro only
  sketch();
}

void loop() {
  // Should get:
  // TestRunner summary:
  //    <n> passed, <n> failed, <n> skipped, <n> timed out, out of <n> test(s).
  aunit::TestRunner::run();
}
contrem commented 1 year ago

fix included in release 3.0.0 - closing