Device re-joins once per day after being silent for 2 hours

jpmeijers commented 3 years ago

Behaviour I observe

Looking at the frame counter of my device over a couple of days I see the following:

The device transmits an uplink every 10 minutes. Once per day the device goes silent for almost 2 hours, then performs a re-join, and then continues for another day before repeating the same behaviour.

Intended behaviour

The device should only join after startup and transmit a frame every 10 minutes. The frame counter should increase forever, until it overflows or the device is restarted.

My device needs to be low power, so it goes to sleep between transmissions. To make sure a transmission is done before going to sleep, I use a blocking transmit function that looks like this:

void tx_blocking(uint8_t* radioPacket, uint8_t packetLength)
{
  do_send(&sendjob, radioPacket, packetLength);

  while(LMIC.opmode & (OP_JOINING|OP_TXRXPEND)) {
    os_runstep();
    delay(10);
  }
  while( (LMIC.opmode & OP_TXDATA) /*&& (millis() - start < TX_TIMEOUT)*/ )
  {
    os_runstep();
    delay(10);
  }

  Serial.println("TX done");
}

My board wakes up every 8 seconds to perform periodic tasks. To make sure the millis() and micros() timers of Arduino continue counting correctly, I update them directly after each wakeup as follows:

void updateMillis(unsigned long new_millis){
  uint8_t oldSREG = SREG;
  cli();
  timer0_millis += new_millis;
  //timer0_overflow_count += 7812; // 4us per timer tick, overflow at 256, 1024us per overflow, 8000000us/1024 = 7812.5
  timer0_overflow_count+= 8 * 64 * clockCyclesPerMicrosecond();
  sei();

  SREG = oldSREG;
}

I have not yet been able to capture debug logs when the rejoin happens. I currently have two theories:

BasicMac performs a re-join for some reason.
My WDT resets the board because something delays for more than 24 seconds.

I will update this issue when I have discovered the reason for the re-join.

In the meantime, is there a low power best practice for using BasicMac? And am I somehow doing something wrong?

matthijskooijman commented 3 years ago

Interesting. I have been doing some practical tests with BasicMAC for our new MJS board as well and saw something different: The board sends data for some 10-20 hourss, then is silent for 10 hours, sends data for 10 hours, is silent for 10 hours, sends data for 10 hours, etc. In my case, no rejoin happens, though. I've only tested up to a few days, since then my battery was empty (still need to work on deeper sleeping). This might very well be a completely separate problem, but it's similar enough to mention.

BasicMac performs a re-join for some reason. My WDT resets the board because something delays for more than 24 seconds.

Yup, that's also the options that I would think of. I do wonder why, especially in the WDT reset case, there is 2 hour silence before the rejoin...

In the meantime, is there a low power best practice for using BasicMac? And am I somehow doing something wrong?

Your code looks good, it's pretty much what I'd do (though my current test setup actually waits for an event rather than checking LMIC.opmode, for legacy reasons).

matthijskooijman commented 3 years ago

I just realized that 2³¹ basicmac ticks of 16μs each is pretty much exactly 9,5 hours, so that suggests that my problem is also caused by BasicMAC.

I also remember that I didn't actually implement the osxticks stuff in the Arduino hal yet (64-bit timestamps), so it does not seem unlikely that that is actually the culprit here. IIRC these longer timestamps are rarely used (and I rather dislike them, so I'm secretly pondering if we cannot just remove them altogether), but maybe this does cause problems now...

matthijskooijman commented 3 years ago

I made a small addition to the debug printing to print the raw (hex) values of timestamps in addition to the human-readable version that is now shown in debug output. In case it's also useful for you, here's the diff:

--- a/lmic/debug.c
+++ b/lmic/debug.c
@@ -199,7 +199,19 @@ static int debug_vsnprintf(char *str, int size, const char *format, va_list arg)
                     }
                 #endif
                     char buf[12], *p = buf;
-                    uint64_t t = ((c == 'T') ? va_arg(arg, uint64_t) : va_arg(arg, uint32_t)) * 1000 / OSTICKS_PER_SEC;
+                    char num[33];
+
+                    uint64_t val = ((c == 'T') ? va_arg(arg, uint64_t) : va_arg(arg, uint32_t));
+                    uint64_t t = val * 1000 / OSTICKS_PER_SEC;
+
+                    if (c == 'T') {
+                        int len = debug_itoa(num, val >> 32, 16, 0, 0, 0, 0);
+                        dst += strpad(dst, end - dst, num, len, 8, 0, '0');
+                    }
+                    int len = debug_itoa(num, val, 16, 0, 0, 0, 0);
+                    dst += strpad(dst, end - dst, num, len, 8, 0, '0');
+                    *dst++ = ' ';
+
                     int ms = t % 1000;
                     t /= 1000;
                     int sec = t % 60;

jpmeijers commented 3 years ago

I was logging the output from my device the past 24 hours, and I finally managed to catch a frame counter reset:

Around the time of the re-join I see the following in my log output (green line in graph above):

Loops = 11924
96148797ms
1998313160us

Loops = 11925
96156801ms
2006705904us
Should be every 10 minutes
08:54:31.529: engineUpdate[opmode=0x808]
Packet queued
Compiled: Nov 18 2020, 16:21:31, 7.3.0
Clock cycles per us: 8
App EUI: <redacted>
Dev EUI: <redacted>
App Key: <redacted>
63898: EV_JOINING

What we see in the above logs is the number of loops/sleeps that have been done, along with the millis() and micros() values, being printed out every 8 seconds.

Then we see that every 75 loops, ie 10 minutes, a TX is queued.

The next line reads "Compiled: ..." which is the first log output after the device starts up.

From this I deduce that the board restarted. This is most likely due to a WDT reset, because the blocking TX took more than 24 seconds.

The gap between the last message and the first one after restart is 14 minutes. Subtract the 10/11 minutes of sleep before the restart leaves us with about 3 minutes. We can see the join took almost 2 minutes. And the WDT timeout before the board is reset is 32s. It feels like a bit of time is missing somewhere, but we're close enough that I believe all we see here is a WDT reset. The other device (yellow line in the graph) is at a remote location, and likely it struggles to join the network, therefore the 2 hours of silence.

The last couple of TXs before restart were done at SF9, which is also strange as I have plenty of gateways around and the device should stick to SF7. I however do see that it keeps on missing the downlink windows. I believe this is also the reason joins take so long (https://github.com/LacunaSpace/basicmac/issues/24). My devices are based on Atmel AVRs with external 8MHz crystal oscillators, so timing should be good.

Full log attached. basicmac-lockup.txt

pe1mew commented 3 years ago

In Apeldoorn we are running a number of P1-nodes that use BasicMAC and we observe the same behaviour as described before. The nodes stop transmitting for hours and than suddenly continue.

The P1-nodes are running a LILYGO TTGO LORA32 868 Mhz ESP32 LoRa OLED node. See: https://github.com/pe1mew/esp32-p1-node .

afbeelding

From 17-11 to 22-11 we have logged the output of one node and found that in 4 days the node (re)joined 7 times. The current logging is not providing good timing details as with JP. I will update the code with the timestap modification earlier presented in the issue.

This is full log of the test device.

P1-Log_001.txt

matthijskooijman commented 3 years ago

@jpmeijers, agreed, a WDT reset sounds plausible in your case. To confirm, you can try figuring out the MCUSR startup value (which might be tricky, since bootloaders often clear it because they use it to reboot after upload and leaving the WDT bit set forces the WDT to be enabled). Sometimes the reset reason is stored in another register or memory location, though.

What's weird is that there's no meaningful debug output from just before the reset, just the Packet queued that is printed by the sketch, I'm assuming. Maybe you could try enabling verbose debug output (in target-config.h), which I think is not enabled now?

Another thought is that maybe an assertion triggers? That runs into an infinite loop and would trigger the WDT. HOwever, it shoudl also print something, and on AVR, I think you shouldn't run into problems with interrupts and serial printing on assert (should be fixed everywhere, but the AVR core is especialy tolerant of e.g. printing with interrupts disabled).

@pe1mew, looking at your image, your problem seems different: Both what @jpmeijers and myself are seeing is very regular, while in your image I see some nodes without gaps, and some nodes with gaps, but they are not always the same duration and not equally spaced?

As for my own problem, I've now found that it is indeed related to the missing support for the osxticks (64-bit ticks) in the HAL. What happens in my sketches is that as soon as the 32-bit ticks overflows back to 0, the "avail" timers of all channels are around 0xF0000000, so close to the maximum. Then, nextTx_dyn compares the current time, and concludes that the channels are all unavailable for the next 18 hours or so. This is because it converts the current and available times to osxtime and compares that, but since the current osxtime is not implemented, both times overflow in 32-bits, but only 64-bit overflow is properly handled.

However, the TX time returned by nextTx_dyn is again converted to 32-bits, so engineUpdate again does 32-bit comparison of timestamps. handling 32-bit overflow properly, so even nextTx_dyn thinks no channels are available, engineUpdate still proceeds with the TX, so this does not immediately break (though during this time, I think my node is sending on the same channel).

However, it seems the available times are now no longer updated, probably because of further 32 vs 64 bit conversions, combined with the special way that available times are stored.

This continues until the current time is around 2³¹ again (by then, engineUpdate still transmits, thinking that the available times are around 9.5 hours in the past). Once it crosses the 2³¹, it now looks like the available times are 9.5 hours in the future, and engineUpdate stops transmitting for the next 9.5 hourse. This last bit keeps repeating 9.5 hours transmitting (on a single channel), 9.5 hours silent.

To fix this, I think I'm going to try and rip out all this osxtime support, since it seems a little broken and convoluted by itself, and even more in the current Arduino HAL and, I think, not strictly needed. Looking at channel/band available times, at 0.1% DC, a max of 2³¹ ticks allows for messages up to 34s airtime, which I think is not possible (256 bytes at SF12 is 14s). And regardless, available times are already stored in a compacted way (~~losing precision on longer intervals~~ as 16-bit seconds, so without too much precision), so this is probably a moot point there. Maybe join delays might be longer than this at some point, but I don't actually think so.

A related issue might be that these timestamps use signed types, and signed overflow is undefined in C/C++ AFAIK. This probably works ok in practice, but in theory, the compiler could make assumptions or optimizations mthat break due to this, so it's probably a good idea to convert to unsigned (and only convert to signed for comparing timestamps after subtraction).

I'm not entirely sure that my issue is actually the same as the one seen by @jpmeijers and @pe1mew, but I'll see if I can fix this issue first, and then we'll see if that maybe also fixes things for you guys?

pe1mew commented 3 years ago

If you have identified an error it is always good to solve that one first and then look further. one thing at a time.

We have registered frame counters too and in addition to the prior observations you find here the graph:

afbeelding

jpmeijers commented 3 years ago

I've ran another experiment, printing out how long the two while loops in my blocking TX function takes. Also increased the WDT timeout to 10 minutes.

The second last successful TX looks like this:

>>>=== Start TX blocking
08:42:59.486: engineUpdate[opmode=0x808]
1961219173: EV_TXSTART
08:42:59.508: engineUpdate[opmode=0x888]
08:42:59.511: TX[mod=LoRa,sf=8,bw=125,cr=4/5,nocrc=0,ih=0,fcnt=158,freq=867.1,pow=16,len=36]: 4059610126809E00019D4FCFF567D6CD227173FA8EC79086CF2DF6ABE4B6F804B381C889
Packet queued
1961231507: EV_TXDONE
08:42:59.705: engineUpdate[opmode=0x888]
08:43:00.688: RX_MODE[mod=LoRa,sf=8,bw=125,cr=4/5,nocrc=1,ih=0,freq=867.1,rxtime=1961292760]
08:43:00.697: WARNING: rxtime is 821 ticks in the past! (ramp-up time 1 ms / 116 ticks)
08:43:00.723: RX: TIMEOUT (525 us)
08:43:01.687: RX_MODE[mod=LoRa,sf=9,bw=125,cr=4/5,nocrc=1,ih=0,freq=869.5,rxtime=1961355324]
08:43:01.695: WARNING: rxtime is 632 ticks in the past! (ramp-up time 1 ms / 116 ticks)
08:43:01.742: RX: TIMEOUT (525 us)
1961443096: EV_TXCOMPLETE (includes waiting for RX windows)
08:43:03.091: engineUpdate[opmode=0x800]
Time waiting for OP_JOINING|OP_TXRXPEND to clear: 3576ms
Time waiting for OP_TXDATA to clear: 0ms
<<<=== End TX blocking

The following TX 10 minutes later, where all goes wrong, looks like this:

...
>>>=== Start TX blocking
08:53:33.109: engineUpdate[opmode=0x808]
Packet queued
Time waiting for OP_JOINING|OP_TXRXPEND to clear: 2ms
<Loop waiting for OP_TXDATA to clear never ends and WDT resets after 10 minutes>
<Device startup>
<Device joins>
<TX with frame counter 0>
...

I think @matthijskooijman has a very good point, and I will wait for the issues he see to be fixed before I continue testing. Likely those are the cause of what I am experiencing.

If you need a different log with extra debug messages, let me know please.

FlorianLudwig commented 3 years ago

Sorry for the offtopic post. I just had to say @pe1mew nice visualizations! They inspire me to update our Grafana dashboards as well :)

matthijskooijman commented 3 years ago

Ok, I just pushed https://github.com/LacunaSpace/basicmac/commits/no-xtime which I think should implement available timing properly without using the (missing) 64-bit timer, which I think should fix my problem (I've done a quick test with a 16x speedup timer, but since that shows availability handling is now ok, it also breaks TX, so I'm not sure it really works. I should know in 2 days or so :-p). Would be good if you could give this a go.

I'm not 100% sure whether this is the most appropriate/best fix, see also https://github.com/mkuyper/basicmac/issues/10 for some additional discussion (I opened that issue there since I'm hoping to merge the work from this repo there soon, and wanted to get @mkuyper's opinion, as he has worked with the code before). But for now, it at least allows moving forward, hopefully.

matthijskooijman commented 3 years ago

Ok, that branch has been running for 3 × 2³¹ ticks now and is still running, so I think my version of this problem is fixed by it. As I said, I'm not sure I like the fix enough to merge it directly, but it would certainly work to see if @pe1mew and @jpmeijers have been seeing the same problem, or maybe some other unrelated cause.

jpmeijers commented 3 years ago

Yes I am seeing positive results too. Normally my device would have rejoined at around 29h, but it is now running for 38h.

jpmeijers commented 3 years ago

~~I'm afraid I have bad news. The device has been running for a while, and started exhibiting the same original behaviour:~~ ![image](https://user-images.githubusercontent.com/948278/101610507-03f25400-3a11-11eb-8a7a-ebb32a713ac2.png)

matthijskooijman commented 3 years ago

@jpmeijers, hm, bummer. But your issue already looked a bit different, so it doesn't really surprise me.

Looking at your graph, it seems that the initial session is just over 8.5 days, and then the subsequent sessions are all pretty much exactly 0.5 days long (though in your earlier tests the sessions were a little more erratic in duration I think). If there is an actual MCU / watchdog reset going on, it seems weird that it would switch from one mode (long session) into another (short sessions), you'd expect nothing to be remembered. So maybe it's not a reset but some other cause?

Some thoughts:

Maybe the reset is related to fcnt? Can you see the highest framecount for each session?
Are you using confirmed uplinks / ADR? Maybe BasicMAC is deciding the session died so it rejoins? Looking at the SF of incoming messages might help here?
Could you capture the serial output (with verbose debug enabled in BasicMAC config)?

jpmeijers commented 3 years ago

I just realised what happened to cause the behaviour in the graph above. By accident I programmed a different firmware version into the test device. Please ignore my last comment (I'll try and remove it).

Two other boards that were running the fixed version of BasicMac has run for two weeks without a problem. Unfortunately I restarted them now, so I won't have longer running results.

pe1mew commented 3 years ago

Hi @matthijskooijman I have updated the library to the latest changes and changed to this fork. Currently, I suffer from this error:

00:00:00.062: E:\arduino_portable\arduino-1.8.12_ESP32\portable\sketchbook\libraries\BasicMAC\src\lmic\radio-sx126x.c:778: assertion failed

Sync bit was not asserted with the right byte. the hardware is OK as it runs without problem with lmic. hardware is a TTGO LORA32 with Oled display V1. The error occurs currently also with the main branch.

I let it rest for now as there are other priorities. sorry I cannot help with this quick test.

LacunaSpace / basicmac

Device re-joins once per day after being silent for 2 hours #25

Behaviour I observe

Intended behaviour