RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.91k stars 1.98k forks source link

tests/pkg_semtech-loramac: hardfault on lobaro-lorabox #11626

Closed kYc0o closed 5 years ago

kYc0o commented 5 years ago

Description

Apparently, #11552 broke compatibility with the lobaro-lorabox board.

According to bisect:

b4d905471fac960a79d7defd9faaeb3f0a9f9d2c is the first bad commit
commit b4d905471fac960a79d7defd9faaeb3f0a9f9d2c
Author: Alexandre Abadie <alexandre.abadie@inria.fr>
Date:   Mon May 20 15:17:50 2019 +0200

    pkg/semtech-loramac: persist OTAA join information in eeprom

    This allows to continue sending data even after a reboot of a device that provide non volatile storage, e.g. eeprom

:040000 040000 6d043ac05b1187ed42adc41386f71e3e8674dd29 d815ff9c1ee99e0b21316d54c871f5a654d7f712 M  pkg

Steps to reproduce the issue

Build tests/pkg_semtech-loramac on current master with the lobaro-lorabox board, and flash it:

BOARD=lobaro-lorabox LORA_DRIVER=sx1272 make -C tests/pkg_semtech-loramac/ flash term

Expected results

INFO # �main(): This is RIOT! (Version: 2019.04-devel-1360-g3e33c-HEAD)
INFO # All up, running the shell now

Actual results

INFO # Context before hardfault:
INFO #    r0: 0x00000000
INFO #    r1: 0x00000001
INFO #    r2: 0x00000001
INFO #    r3: 0x00000000
INFO #   r12: 0x00000c02
INFO #    lr: 0x0800205f
INFO #    pc: 0x00000000
INFO #   psr: 0x60000000
INFO # 
INFO # FSR/FAR:
INFO #  CFSR: 0x00020000
INFO #  HFSR: 0x40000000
INFO #  DFSR: 0x00000000
INFO #  AFSR: 0x00000000
INFO # Misc
INFO # EXC_RET: 0xfffffffd
INFO # Attempting to reconstruct state for debugging...
INFO # In GDB:
INFO #   set $pc=0x0
INFO #   frame 0
INFO #   bt
INFO # 
INFO # ISR stack overflowed by at least 16 bytes.

Versions

Operating System Environment
-----------------------------
       Operating System: "Debian GNU/Linux" "9 (stretch)"
                 Kernel: Linux 4.9.0-9-amd64 x86_64 unknown

Installed compiler toolchains
-----------------------------
             native gcc: gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
      arm-none-eabi-gcc: arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2018-q4-major) 8.2.1 20181213 (release) [gcc-8-branch revision 267074]
                avr-gcc: avr-gcc (GCC) 8.2.0
       mips-mti-elf-gcc: missing
             msp430-gcc: missing
   riscv-none-embed-gcc: missing
   xtensa-esp32-elf-gcc: missing
   xtensa-lx106-elf-gcc: missing
                  clang: missing

Installed compiler libs
-----------------------
   arm-none-eabi-newlib: "3.0.0"
    mips-mti-elf-newlib: missing
riscv-none-embed-newlib: missing
xtensa-esp32-elf-newlib: missing
xtensa-lx106-elf-newlib: missing
               avr-libc: "2.0.0" ("20150208")

Installed development tools
---------------------------
                  cmake: cmake version 3.7.2
               cppcheck: Cppcheck 1.76.1
                doxygen: 1.8.13
                 flake8: 3.2.1 (mccabe: 0.5.3, pyflakes: 1.3.0, pycodestyle: 2.2.0) CPython 3.5.3 on Linux
                    git: git version 2.11.0
                   make: GNU Make 4.1
                openocd: Open On-Chip Debugger 0.10.0+dev-00692-g1c22f5b7d (2019-02-15-13:53)
                 python: Python 2.7.13
                python2: Python 2.7.13
                python3: Python 3.5.3
             coccinelle: spatch version 1.0.4 with Python support and with PCRE support
kYc0o commented 5 years ago

Maybe @aabadie or @fjmolinas have more insight.

kYc0o commented 5 years ago

I think I found the reason... After erasing the whole EEPROM the test behaves more or less correctly, although I get always a kind of lost interrupt:

2019-06-04 14:52:21,151 - INFO # �main(): This is RIOT! (Version: 2019.04-devel-1392-gfcd7f2)
2019-06-04 14:52:21,154 - INFO # [sx127x] netdev: sx127x_on_dio1: unknown state
2019-06-04 14:52:21,155 - INFO # All up, running the shell now

I'm still investigating the real cause.

fjmolinas commented 5 years ago

@kYc0o since lobaro-lorabox is similar to im880b cpu wise, could you see if #11314 fixes your issue? If not I'll try to reproduce.

kYc0o commented 5 years ago

Unfortunately it doesn't solve the problem. I have some weird behaviour on this lobaro board. I tested my application rebased to current master on a b-l072z-lrwan1 and it works correctly. There's something specific on that board which is not compatible with what it was done on #11552.

fjmolinas commented 5 years ago

@kYc0o I'll try to reproduce on IM880b tomorrow, it's the closes hardware I have to lobaro.

Also, does the application work if you don't provide eeprom?

kYc0o commented 5 years ago

Thanks for the advice @fjmolinas ! It actually worked if I disable the eeprom. I'll investigate from there.

fjmolinas commented 5 years ago

@kYc0o I was able to reproduce your issue, although it only appears after the first loramac save.

I can fix the issue by initiating uint8_t dr = 0 at:

https://github.com/RIOT-OS/RIOT/blob/c81ff3862c2acb7e240a102758052cd2816f0434/pkg/semtech-loramac/contrib/semtech_loramac.c#L272

no idea why... I'm guessing some kind of optimization problem

kYc0o commented 5 years ago

Thanks for the pointer! Indeed that solves the hardfault problem, however, I have the impression that reading and saving data into EEPROM might take some time on some "slow" devices, like the stm32l151, or I'm missing something, since the device doesn't get downlink messages.

With periph_eeprom disabled, the device behaves correctly and sends the confirmable messages without any problem, but when it's enabled the device is unable to receive the ACK:

2019-06-05 11:50:42,538 - INFO # �[semtech-loramac] initializing loramac
2019-06-05 11:50:42,539 - INFO # [semtech-loramac] reading configuration from EEPROM
2019-06-05 11:50:42,540 - INFO # [semtech-loramac] reading uplink counter: 0 
2019-06-05 11:50:42,541 - INFO # [semtech-loramac] reading rx2 freq: 869525000
2019-06-05 11:50:42,542 - INFO # [semtech-loramac] reading rx2 dr: 0
2019-06-05 11:50:42,543 - INFO # [semtech-loramac] set join state ? 1
2019-06-05 11:50:42,545 - INFO # main(): This is RIOT! (Version: 2019.04-devel-1399-g8fa40-lora_cayennelpp)
2019-06-05 11:50:42,546 - INFO # [semtech-loramac] initializing loramac
2019-06-05 11:50:42,547 - INFO # [semtech-loramac] reading configuration from EEPROM
2019-06-05 11:50:42,547 - INFO # [semtech-loramac] reading uplink counter: 0 
2019-06-05 11:50:42,549 - INFO # [semtech-loramac] reading rx2 freq: 869525000
2019-06-05 11:50:42,550 - INFO # [semtech-loramac] reading rx2 dr: 0
2019-06-05 11:50:42,551 - INFO # [semtech-loramac] set join state ? 1
2019-06-05 11:50:42,552 - INFO # [semtech-loramac] Starting join procedure: 0
2019-06-05 11:50:42,552 - INFO # [semtech-loramac] network is already joined
2019-06-05 11:50:42,553 - INFO # Current Time:   2019-07-03 17:56:00
2019-06-05 11:50:42,554 - INFO # Next Alarm:   2019-07-03 17:59:00
2019-06-05 11:53:41,949 - INFO # Current Alarm:   2019-07-03 17:59:00
2019-06-05 11:53:41,952 - INFO # Current time:   2019-07-03 17:59:00
2019-06-05 11:53:41,957 - INFO # Next Alarm:   2019-07-03 18:02:00
2019-06-05 11:53:41,958 - INFO # Sending temperature...
2019-06-05 11:53:42,720 - INFO # [semtech-loramac] loramac cmd msg
2019-06-05 11:53:42,723 - INFO # [semtech-loramac] send frame g
2019-06-05 11:53:42,728 - INFO # [semtech-loramac] MCPS request: confirmed TX
2019-06-05 11:53:42,735 - INFO # [semtech-loramac] MCPS request: OK
2019-06-05 11:53:43,737 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:44,055 - INFO # [semtech-loramac] Transmission completed
2019-06-05 11:53:44,744 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:44,861 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:45,522 - INFO # [semtech-loramac] RX timer timeout
2019-06-05 11:53:45,751 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:45,865 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:46,525 - INFO # [semtech-loramac] RX timer timeout
2019-06-05 11:53:46,757 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:47,658 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:47,764 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:48,771 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:49,777 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:50,784 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:51,790 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:52,797 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:53,803 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:54,810 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:55,816 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:56,823 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:57,829 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:58,836 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:53:59,842 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 11:54:00,849 - INFO # [semtech-loramac] MAC timer timeout
.
.
.

It just keeps the timeout indefinitely. Did you experience something similar?

aabadie commented 5 years ago

Did you experience something similar?

Are you using ABP activation mode ? Are you using TTN network ? If yes to both, then this means you must set RX2 datarate to 3. Otherwise the RX2 frame might miss messages sent by the server.

kYc0o commented 5 years ago

Are you using ABP activation mode ? Are you using TTN network ?

No to both, I'm using my own gateway with a RPi and an iC880a concentrator. I'll try anyways to set the RX2 datarate to 3 to see what happens. What attire my attention is that this is not needed when the EEPROM operations don't take place, thus it seems to me more a timing problem. I also noticed that such operations are faster on a b-l072z-lrwan1, due to its faster processor.

fjmolinas commented 5 years ago

@kYc0o I usually avoid using DEBUG messages here, it can mess up the timing of the windows. For examples using IM880b I'm not able to successfully join when DEBUG messages are enabled (the same ones you have), the message is delivered but the downlinnk isnt captured. Can you try without DEBUG?

kYc0o commented 5 years ago

This is a completed transmission with confirmable messages reaching my server:

2019-06-05 12:06:17,237 - INFO # �[semtech-loramac] initializing loramac
2019-06-05 12:06:17,238 - INFO # main(): This is RIOT! (Version: 2019.04-devel-1399-g8fa40-lora_cayennelpp)
2019-06-05 12:06:17,239 - INFO # [semtech-loramac] initializing loramac
2019-06-05 12:06:17,240 - INFO # [semtech-loramac] Starting join procedure: 0
2019-06-05 12:06:17,241 - INFO # [semtech-loramac] loramac cmd msg
2019-06-05 12:06:17,242 - INFO # [semtech-loramac] starting OTAA join
2019-06-05 12:06:17,666 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:18,148 - INFO # [semtech-loramac] Transmission completed
2019-06-05 12:06:18,673 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:19,680 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:20,687 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:21,694 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:22,701 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:22,969 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:23,708 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:23,842 - INFO # [semtech-loramac] unexpected netdev event received: 1
2019-06-05 12:06:23,972 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:06:24,496 - INFO # [semtech-loramac] MLME confirm event
2019-06-05 12:06:24,503 - INFO # [semtech-loramac] MLME confirm msg received
2019-06-05 12:06:24,504 - INFO # [semtech-loramac] join succeeded
2019-06-05 12:06:24,506 - INFO # Current Time:   2019-07-03 17:56:00
2019-06-05 12:06:24,509 - INFO # Next Alarm:   2019-07-03 17:59:07
2019-06-05 12:06:24,513 - INFO # [semtech-loramac] Starting join procedure: 0
2019-06-05 12:06:24,517 - INFO # [semtech-loramac] network is already joined
2019-06-05 12:06:24,520 - INFO # Current Time:   2019-07-03 17:56:00
2019-06-05 12:06:24,524 - INFO # Next Alarm:   2019-07-03 17:59:07
2019-06-05 12:09:24,505 - INFO # Current Alarm:   2019-07-03 17:59:07
2019-06-05 12:09:24,509 - INFO # Current time:   2019-07-03 17:59:07
2019-06-05 12:09:24,514 - INFO # Next Alarm:   2019-07-03 18:02:07
2019-06-05 12:09:24,515 - INFO # Sending temperature...
2019-06-05 12:09:25,276 - INFO # [semtech-loramac] loramac cmd msg
2019-06-05 12:09:25,279 - INFO # [semtech-loramac] send frame g
2019-06-05 12:09:25,284 - INFO # [semtech-loramac] MCPS request: confirmed TX
2019-06-05 12:09:25,291 - INFO # [semtech-loramac] MCPS request: OK
2019-06-05 12:09:26,292 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:09:26,611 - INFO # [semtech-loramac] Transmission completed
2019-06-05 12:09:27,299 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:09:27,416 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:09:28,303 - INFO # [semtech-loramac] unexpected netdev event received: 1
2019-06-05 12:09:28,306 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:09:28,419 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:09:28,795 - INFO # [semtech-loramac] MCPS confirm event
2019-06-05 12:09:28,797 - INFO # [semtech-loramac] MCPS indication event
2019-06-05 12:09:28,802 - INFO # [semtech-loramac] MCPS confirm msg received
2019-06-05 12:09:28,805 - INFO # [semtech-loramac] MCPS confirm event OK
2019-06-05 12:09:28,809 - INFO # [semtech-loramac] MCPS confirm event: CONFIRMED
2019-06-05 12:09:28,815 - INFO # [semtech-loramac] forward TX status to sender thread
2019-06-05 12:09:28,819 - INFO # [semtech-loramac] MCPS indication msg received
2019-06-05 12:09:28,822 - INFO # [semtech-loramac] MCPS indication Unconfirmed
2019-06-05 12:09:28,826 - INFO # [semtech-loramac] MCPS indication: ACK received
2019-06-05 12:09:28,833 - INFO # [semtech-loramac] received something
2019-06-05 12:09:28,834 - INFO # Received ACK from network
kYc0o commented 5 years ago

Can you try without DEBUG?

As you see with DEBUG things still working. Using periph_eeprom fails with or without DEBUG. I assume it keeps looping on the timeout anyways.

fjmolinas commented 5 years ago

@kYc0o Regarding the confirmed message problem I'm able to get confirmed messages on im880b:

> loramajoin otaa
2019-06-05 12:37:13,893 - INFO #  loramac join otaa
2019-06-05 12:37:23,257 - INFO # Join procedure succeeded!
> loramac tx asdf cnf
2019-06-05 12:37:31,164 - INFO #  loramac tx asdf cnf
2019-06-05 12:37:34,662 - INFO # Received ACK from network
2019-06-05 12:37:34,664 - INFO # Message sent with success

And also with debug messages:

 loramatx asdf cnf
2019-06-05 12:40:12,575 - INFO #  loramac tx asdf cnf
2019-06-05 12:40:12,578 - INFO # [semtech-loramac] loramac cmd msg
2019-06-05 12:40:12,580 - INFO # [semtech-loramac] send frame asdf
2019-06-05 12:40:12,584 - INFO # [semtech-loramac] MCPS request: confirmed TX
2019-06-05 12:40:12,592 - INFO # [semtech-loramac] MCPS request: OK
2019-06-05 12:40:13,591 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:40:13,912 - INFO # [semtech-loramac] Transmission completed
2019-06-05 12:40:14,594 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:40:14,716 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:40:15,376 - INFO # [semtech-loramac] RX timer timeout
2019-06-05 12:40:15,597 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:40:15,845 - INFO # [semtech-loramac] MAC timer timeout
2019-06-05 12:40:15,999 - INFO # [semtech-loramac] unexpected netdev event received: 1
2019-06-05 12:40:16,060 - INFO # [semtech-loramac] MCPS confirm event
2019-06-05 12:40:16,064 - INFO # [semtech-loramac] MCPS indication event
2019-06-05 12:40:16,067 - INFO # [semtech-loramac] MCPS confirm msg received
2019-06-05 12:40:16,071 - INFO # [semtech-loramac] MCPS confirm event OK
2019-06-05 12:40:16,075 - INFO # [semtech-loramac] saving uplink counter: 1 
2019-06-05 12:40:16,106 - INFO # [semtech-loramac] MCPS confirm event: CONFIRMED
2019-06-05 12:40:16,110 - INFO # [semtech-loramac] forward TX status to sender thread
2019-06-05 12:40:16,114 - INFO # [semtech-loramac] MCPS indication msg received
2019-06-05 12:40:16,119 - INFO # [semtech-loramac] MCPS indication Unconfirmed
2019-06-05 12:40:16,123 - INFO # [semtech-loramac] MCPS indication: ACK received
2019-06-05 12:40:16,126 - INFO # [semtech-loramac] received something
2019-06-05 12:40:16,128 - INFO # Received ACK from network
2019-06-05 12:40:16,131 - INFO # Message sent with success
fjmolinas commented 5 years ago

Anyway regarding the optimization issue causing the HARDFAULT, I have been trying to find the reason for it but I can't. Initializing the variables fixes the issue but I don't know why, any ideas @aabadie? A fix could be just initializing them all.

    /* Read RX2 datarate */
    uint8_t dr = 0;
    pos += eeprom_read(pos, &dr, 1);
    DEBUG("[semtech-loramac] reading rx2 dr: %d\n", dr);
    semtech_loramac_set_rx2_dr(mac, dr);
kYc0o commented 5 years ago

@fjmolinas can you try to send the confirmable messages after a reset with settings saved to eeprom? (when the settings are saved and no join is necessary).

fjmolinas commented 5 years ago

@kYc0o I'm getting the same behaviour as you. I'll keep looking into it.

aabadie commented 5 years ago

The only thing that is stored on EEPROM in the MAC context is the uplink counter value, see here. All other information is read once during initialization or written when the user explicitly does this (when calling loramac save from the shell for example).

Can you try to comment out the call to _save_uplink_counter from the event loop ? Then erase the information on EEPROM, join the network, save, reboot and send.

kYc0o commented 5 years ago

I did it and I have the same behaviour. I suspect now #11541, which changed completely the receiving mechanism.

aabadie commented 5 years ago

I suspect now #11541, which changed completely the receiving mechanism.

But you said initially that it was working without periph_eeprom feature. And your bisect doesn't point to #11541.

kYc0o commented 5 years ago

True, I'll need to make more tests by removing some of the newly introduced commits to check what's exactly triggering the problems.

fjmolinas commented 5 years ago

@kYc0o where you able to isolate the problem from your new commits? Are you still facing this problem?

aabadie commented 5 years ago

Some good news with this issue: I was able to reproduce the issue on im880b board (same CPU) and could fix it (hopefully).

In fact, I found several issues:

I still need to do more testing before opening PRs.

kYc0o commented 5 years ago

@fjmolinas yes the issue is still present.

@aabadie thanks for your investigations! I'll test your fixes and try to get a bit deeper into the reasons. I'll post if I find something.

aabadie commented 5 years ago

@kYc0o #11783 should solve your issue. I also opened #11777 to address the last point.