nicolas-juteau commented 10 months ago

What happened?

Using Multitech Conduit (MTCDT) gateway:

At some point during normal operation, the mqtt-forwarder will stop processing downlink correctly to end-devices (i.e: end-devices will no longer receive any downlink). However, uplinks from end-devices that already joined the network will continue to work as expected until they issue another join request, in which case they will never be able the join back since downlink stopped working.

From there, for each downlink received by the mqtt-forwarder, the following can be seen in logs (log truncated to keep only relevant messages, see Log Output section for full log):

2023-12-07T18:58:18.339111+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received uplink frame, uplink_id: 926004756 2023-12-07T18:58:18.339607+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending uplink event, uplink_id: 926004756, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/event/up 2023-12-07T18:58:18.736102+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received message, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down, qos: 0 2023-12-07T18:58:18.736680+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received downlink command, downlink_id: 416456622, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down 2023-12-07T18:58:18.737035+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending downlink frame, downlink_id: 416456622 2023-12-07T18:58:18.740470+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Enqueueing timestamped packet, downlink_id: 416456622, counter_us: 2156716995, current_counter_us: 2156125804 2023-12-07T18:58:18.838115+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Handling message error, error: Could not read down response 2023-12-07T18:58:19.300778+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Scheduled packet for TX, downlink_id: 416456622, count_us: 2156716995, freq: 924500000, bw: 500000, mod: LoRa, dr: SF7 2023-12-07T18:59:12.029305+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Frame received, uplink_id: 2621097563, count_us: 2209411052, freq: 903700000, bw: 125000, mod: LoRa, dr: SF7 2023-12-07T18:59:12.032160+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received uplink frame, uplink_id: 2621097563 2023-12-07T18:59:12.032640+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending uplink event, uplink_id: 2621097563, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/event/up 2023-12-07T18:59:12.393714+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received message, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down, qos: 0 2023-12-07T18:59:12.394289+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received downlink command, downlink_id: 939217614, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down 2023-12-07T18:59:12.394646+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending downlink frame, downlink_id: 939217614 2023-12-07T18:59:12.406672+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Handling message error, error: Operation cannot be accomplished in current state

The MQTT forwarder will remain in this state, until the process is manually restarted.

NOTE: i've noticed this behavior only with the MTCDT (Multitech Conduit).

What did you expect

I'm expecting this error not to occur and downlinks continue to be processed correctly as it should.

Steps to reproduce this issue

Unfortunately, I haven't been able to systematically reproduce this issue all the time. It happens from time to time, randomly. Since uplinks continue to work as it should, it becomes only noticeable when we manually issue a downlink to a end-device and realizes that the downlink is not received by the end-device itself.

Log output

2023-12-07T18:58:18.339111+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received uplink frame, uplink_id: 926004756 2023-12-07T18:58:18.339607+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending uplink event, uplink_id: 926004756, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/event/up 2023-12-07T18:58:18.360461+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1ee8 Instr=0xe79a7009 Address=0xb6f50e52 FSR 0x001 2023-12-07T18:58:18.360595+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1f38 Instr=0xe7962007 Address=0xb6f50ee1 FSR 0x001 2023-12-07T18:58:18.371507+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1f40 Instr=0xe5992004 Address=0xb6f50ee5 FSR 0x001 2023-12-07T18:58:18.736102+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received message, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down, qos: 0 2023-12-07T18:58:18.736680+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received downlink command, downlink_id: 416456622, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down 2023-12-07T18:58:18.737035+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending downlink frame, downlink_id: 416456622 2023-12-07T18:58:18.740470+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Enqueueing timestamped packet, downlink_id: 416456622, counter_us: 2156716995, current_counter_us: 2156125804 2023-12-07T18:58:18.826949+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: GPS time reference is not valid, age: 1701975498.8265858s 2023-12-07T18:58:18.838115+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Handling message error, error: Could not read down response 2023-12-07T18:58:19.300778+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Scheduled packet for TX, downlink_id: 416456622, count_us: 2156716995, freq: 924500000, bw: 500000, mod: LoRa, dr: SF7 2023-12-07T18:59:12.029305+00:00 mtcdt chirpstack-concentratord-sx1301[7183]: Frame received, uplink_id: 2621097563, count_us: 2209411052, freq: 903700000, bw: 125000, mod: LoRa, dr: SF7 2023-12-07T18:59:12.032160+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received uplink frame, uplink_id: 2621097563 2023-12-07T18:59:12.032640+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending uplink event, uplink_id: 2621097563, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/event/up 2023-12-07T18:59:12.056587+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1ee8 Instr=0xe79a7009 Address=0xb6f51912 FSR 0x001 2023-12-07T18:59:12.056723+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1f38 Instr=0xe7962007 Address=0xb6f51992 FSR 0x001 2023-12-07T18:59:12.067794+00:00 mtcdt kernel: Alignment trap: chirpstack-mqtt (7261) PC=0x001f1f40 Instr=0xe5992004 Address=0xb6f51996 FSR 0x001 2023-12-07T18:59:12.393714+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received message, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down, qos: 0 2023-12-07T18:59:12.394289+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Received downlink command, downlink_id: 939217614, topic: us915_0/gateway/XXXXXXXXXXXXXXXX/command/down 2023-12-07T18:59:12.394646+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Sending downlink frame, downlink_id: 939217614 2023-12-07T18:59:12.406672+00:00 mtcdt chirpstack-mqtt-forwarder[7255]: Handling message error, error: Operation cannot be accomplished in current state

In this log transcript, two events can be seen and chances are they might something have to do with the current issue:

kernel alignment trap error (this occurs frequently, as soon as mqtt-forwarder process is started)
"GPS time reference is not valid" message from concentratord (this messages occurs at every ~1s after concentratord is started)

Environment

Component	Version
Gateway HW	MTCDT-0.1
Gateway OS	mLinux 5.3.31 (custom image)
MQTT Forwarder	4.1.3 (installed from here)
Concentratord	4.3.4 (installed from here)
Chirpstack	4.6.0 (cloud based)

EDIT: added custom image detail to Gateway OS

bconway commented 10 months ago

Adjacent: Do you have the option to use mPower firmware on that model? I have not run into anything similar on my MTCAP models.

https://www.multitech.net/developer/downloads/

It looks like all the tested models are using mPower (which is also the recommended install when I've spoken with MultiTech in the past):

https://www.chirpstack.io/docs/chirpstack-mqtt-forwarder/install/multitech.html

nicolas-juteau commented 10 months ago

As far as I know, mPower is just the Multitech Application Enablement Platform (AEP) that sits on top of mLinux. I'd expect the kernel alignment trap issue to occur as well on mPower (with same mLinux version). I might give it a try to confirm this.

Also, we are using a custom built mlinux image (that's why we don't use mPower). Might try with a fresh standalone mLinux as well to confirm this.

I have not encountered this issue on MTCAP/MTCAP3 as well, though we've not exprimented that much with those models so we can't state this will never occur in the future. It is encouraging to learn you have never witnessed this issue on MTCAPs

nicolas-juteau commented 10 months ago

I confirm this happens with latest version of mPower as well (6.3.1)

brocaar commented 10 months ago

Handling message error, error: Operation cannot be accomplished in current state

This is a ZeroMQ error, so it seems that the Concentratord <> MQTT Forwarder communication somehow breaks.

I'm looking into this right now. I have a Multitech Conduit which I'm now upgrading to 6.3.1 after which I will try to reproduce the issue.

brocaar commented 10 months ago

I can reproduce the error:

kernel: Alignment trap: chirpstack-mqtt (5772) PC=0x001f1ee8 Instr=0xe79a7009 Address=0xb6a42303 FSR 0x001

brocaar commented 10 months ago

I have tried with the latest master version, and I no longer see the error.

The latest master version replaces paho-mqtt by rumqttc. The first links against the Paho C library, which also depends on OpenSSL. rumqttc is a pure-Rust MQTT client, for TLS it depends on rustls. I suspect that the alignment trap issue was caused by the cross-compiling and linking to C code.

I will create a MQTT Forwarder v4.2.0-test.1 build for you today for testing.

brocaar commented 10 months ago

Could you try again with this build:

chirpstack-mqtt-forwarder.zip

It is the latest master version built for the Conduit (you need to unzip first, as GitHub doesn't allow .ipk uploads in comments). I have also added this change: https://github.com/chirpstack/chirpstack-mqtt-forwarder/commit/95439f1edb7c07fdfac1afc6012a1e37bfdd2d54. In case there is a ZeroMQ error on the REQ socket, it will automatically reconnect as after an error the socket might be in a 'dirty' state.

nicolas-juteau commented 10 months ago

The build you provided seems to resolve the kernel panic issue. I'll let the gateway run for a while and see if the issue ever occurs again. If so, i'll open a new ticket and refer to this one.

bconway commented 10 months ago

I'm able to reproduce after upgrading my Conduit AP to 6.3.1, would be happy to see the 4.2.0-test.1 build for that device.

brocaar commented 10 months ago

Hi @bconway , please find attached the test build for the Conduit AP.

chirpstack-mqtt-forwarder.zip

I was hoping to create an "official" test build, but during the refactor from Paho (with OpenSSL dependency) to Rumqttc (with Rustls dependency, based on ring) I missed that ring does not yet support mips- targets. There was a MIPS related pull-request which got merged, but this was only for mipsel- targets (little endian). There is a new PR for mips- targets which hopefully will get merged soon.

bconway commented 10 months ago

Great, thanks! With the change in MQTT client, are there any configuration changes under [mqtt], or was everything existing still supported? Thanks.

brocaar commented 10 months ago

The MQTT config options remained the same :)

brocaar commented 7 months ago

As there hasn't been any activity on this issue, I assume the proposed fix that is part of the latest version(s) did solve the issue. Please let me know if this was not the case.

chirpstack / chirpstack-mqtt-forwarder

Multitech Conduit: mqtt-forwarder randomly enters an invalid state in which downlink are no longer sent sent to end-devices #47

What happened?

What did you expect

Steps to reproduce this issue

Log output

Environment