No packets with large time offset

Lora-net / packet_forwarder

A LoRa packet forwarder is a program running on the host of a LoRa gateway that forwards RF packets receive by the concentrator to a server through a IP/UDP link, and emits RF packets that are sent by the server. This project is associated to the lora_gateway repository for SX1301 chip. For SX1302/1303, the repository sx1302_hal must be used.

Other

726 stars 668 forks source link

No packets with large time offset #35

Closed a-andreyev closed 7 years ago

a-andreyev commented 7 years ago

Hello! I'm using git verstion of packet_forwarder and lora_gateway with DIY rpi2 (archlinux arm distro) and iC880A-based gateway (without GPS) and RFM95W-based transmitter with OTAA and private server. Looks like after random periods of time I'm receiving no packets at all and it only could be fixed via restarting the packet forwarder (my systemd lora_pkt_fwd service also contains reset_lgw.sh script from lora_gateway with custom pin). And according to logs packet loss is accompanied with time offset larger that 60000ms:

INFO: host/sx1301 time offset=(1491594643s:916918µs) - drift=60000286µs
...
host/sx1301 time offset=(1491597498s:536800µs) - drift=-1693691040µs

Could you help me with resolving the issue? Should I look at lora_gateway code or is it looks like hardware problem?

mcoracin commented 7 years ago

Hello,

Which version of lora_gateway and packet_forwarder are you using?

a-andreyev commented 7 years ago

I've tried actual git versions: v4.0.1 (d0226ea) and v4.0.0 (c05eb0e)

mcoracin commented 7 years ago

How is the iC880A connected to the RPi? You can try to activate debug logs in the HAL (libloragw/library.cfg), by setting DEBUG_HAL to 1.

You can also test the robustness of you SPI connexion with the util_spi_stress application provided with the HAL (with -t4 option). Let it run few hours to ensure you have no errors.

a-andreyev commented 7 years ago

Thank you for the tips! iC880A is connected to RPi2 via SPI pins directly. I've tried util_spi_stress for few minutes -- no errors. Will try it with driver debug flag and -t4 option today for a longer time.

a-andreyev commented 7 years ago

I've decided to put off the util_spi_stress and started the debug build (DEBUG_HAL to 1) for a several hours with real node and packet_forwarder. And I've found that at some moment SX1301 time (PPS) value stopped changing. Actually it is exactly the moment when I've stopped receiving the packets by packet forwarder and seeing a large time offset according to logs. Should I move to lora_gateway issues thread or is it something more should I check?

mcoracin commented 7 years ago

hmmm, it seems that the SX1301 stops working for some reason... A similar issue has been seen some times ago and was due to a concurrent access to the SX1301 through SPI between 2 pkt fwd threads. This has been fixed by adding a mutex in the thread_timersync when getting the sx1301 counter.

Maybe running the test with DEBUG_SPI set to 1 could help, though it will be very verbose.

jmlemetayer commented 7 years ago

I have seen something similar between a packet forwarder (calling lgw_start) and another binary using the lora_gateway (calling lgw_connect).

In fact the second binary was resetting the sx1301 by calling lgw_connect(false, ...). So everything seems to be good except the page register which was set back to 0 (_in the sx1301 registers but not in the lgw_regpage global variable as the two binaries have two different memory spaces_).

Finally, when the packet forwarder wanted to get the TIMESTAMP register (register 70, page 2) it was reading the CORR5_DETECT_EN (register 70, page 0). By the way, this register have a value similar to 0x7E000000 = 2113929216...

A solution to this issue can be to change the software architecture and to use a dedicated daemon to drive the LoRa stack. This is something very common. The wpa_supplicant for the Wi-Fi is an example:

wpa_supplicant_arch

The data bus used in the wpa_supplicant can be dbus or an internal control interface.

mcoracin commented 7 years ago

Yes, the current lora_gateway library is definitely not done to have concurrent process using it.

a-andreyev commented 7 years ago

@jmlemetayer, thank you for the response! Am I right that if I have no other binaries that are using lora_gateway then your use-case is not applied to me?

Today I've discovered that sx1301 have stopped working and is sending2113929216 value from the moment when I've pluged my laptop in the socket in the same room and turned it on to view the logs. So looks like hardware problem for me, trying to solve it and update the status of the issue. Thank you, @mcoracin, for your help and your project!

mcoracin commented 7 years ago

You're welcome. I'll close the issue for now, you can reopen it if needed.

a-andreyev commented 7 years ago

Just a note to say: I haven't resolved the hardware issue when sx1301 stops and it's somehow connected with power sockets in my room. I've created a software patch where I'm waiting for several drift values larger than 60000ms (separated it as constant in timersync thread) and after that restarting my packet forwarder with reset script. It's not great at all, but it works.

wateras commented 5 years ago

I also encountered a similar problem and sx1301 did not work, the PPS value read did not change, is this problem solved?

pauldeng commented 3 years ago

I encountered this issue.

No Rx after large time offset happened. Reboot will make the RX come back, but still no TX. Power cycle perhaps needed.

@a-andreyev Do you still have your reset script? Would you mind share it?

Thanks.

a-andreyev commented 3 years ago

Hello, @pauldeng. Unfortunately, I don't have the script anymore. The logic probably was to check for suspicious repeating values (SX1301 time (PPS)) at src/lora_pkt_fwd.c and to exit the app with an error. Then to restart it with systemd (I've used systems script to handle the startup). Not sure, this comment should describe it better, but I didn't worked with the project for a long time and don't remember the details.

Anyway, it was a hardware issue in my case, and I was able to reproduce it by adding an additional device (like a laptop) to the power socket. From friends I've heard the electrical current-related term once that could guess the effect, but I don't remember it, unfortunately (something about high impedance, or crosstalk, not sure).

pauldeng commented 3 years ago

Hi @a-andreyev ,

Thanks for the additional info. I will discuss this with the manufacture.

It seems very rare case as not so many people report here for years.

In my case, the chips still cannot Tx after Linux system reboot. I will check again to see if power cycle fix it.

Thanks again.