Failed join when end node too close to gateway

chopmann commented 2 years ago

FROM: https://forum.chirpstack.io/t/failed-join-when-end-node-too-close-to-gateway/12537/3

[x] The issue is present in the latest release.
[x] I have searched the issues of this repository and believe that this is not a duplicate.

What happened?

I’m working with the end node mere meters away from our test gateway here at the office. Join fails most of the time, because Join-Message is received in multiple channels.

What did you expect?

Join succeeding. For this corner case to be fixed, de-duplication needs to be a bit smarter and only replace an ongoing join request with a new one if less than X time has passed (1s would work) and the the new request has higher RSSI than the previous one.

Steps to reproduce this issue

Steps:

End node transmits join request on 868.500.
Gateway receives join request on 868.500 (RSSI -63) and on adjacent channel 868.300 (RSSI -97)
Gateway sends both join requests to network server.
Network server responds to join request on 868.500 with a join accept on 868.500.
Network server responds to join request on 868.300 with a join accept on 868.300 followed by an RX2 join accept on 869.525. This is the join that remains valid to the network server.
End node receives join accept on 868.500 and it now thinks it has successfully joined.

Surely you see the problem. To the network server the assigned network address and session key are the ones it sent last on 868.300, whereas to the end node they are the ones that it receives on 868.500.

As a result, both happily think the join was successful but communication is impossible because network address and session keys are different.

Could you share your log output?

Your Environment

Component	Version
Application Server	v?.?.?
Network Server
Gateway Bridge
Chirpstack API
Geolocation
Concentratord

iggarpe commented 2 years ago

I'm the poster of the message on the forum. Some extra info:

I'm using network server 3.15.3

The problem is clearly that in the network server the second join request received overwrites the first after the join accept has been already sent. If the gateway, as is my case, sends to the network server first the true join request then the false join request on the adjacent channel with much lower RSSI, both are replied with a join accept but only the false one remains as the valid join to the network server, whereas the end node actually receives the first join accept and happily things it has successfully joined.

The fix would be super easy: when more than one join request is received in a small time window (1s?) from the same end node, overwrite the previous one only if the RSSI is way higher.

This would work no matter the order in which the gateway sends the join requests, true one first / false one first.

brocaar commented 2 years ago

This issue has been brought up a couple of times. What happens is that when a device is really close to the gateway, it will over-drive the gateway hardware causing a "ghost" package. Thus in the end the uplink is reported on two frequencies.

Currently the de-duplication logic does not inspect the LoRaWAN payload. Based on the raw payload + frequency it starts the de-duplication logic meaning that when the same payload was reported on two frequencies, there are two de-duplication functions running simultaneously and the first one "wins".

I'm not sure of over-driving the gateway radios can cause any permanent damage to the gateway. Maybe somebody else can comment on this. My assumption is that you should try to avoid this.

As well, I'm not sure what would be the best, secure and still performant solution for this assuming this scenario doesn't cause any harm to the gateway radios.

The reason why the de-duplication logic includes the frequency in its key is because there was a security issue reported a while ago, which would allow a re-play with a better RSSI / SNR to "steal" the downlink path. One could replay the uplink within the configured de-duplication duration using a different frequency, and with that breaking the downlink path (e.g. letting the LNS respond using a different frequency or time).

I'm open to suggestions.

(also posted on the forum)

urbie-mk2 commented 2 years ago

I have access to multiple gateways in development and device production where lorawan connectivity is tested. Devices have a range of about 5m to gateway, so far the gatewa, which uses the sentech hardware reference Implementation, does run for 3 years after having produced 10000 of nodes. So damaging the hardware is not likely but I have never seen this behaviour before where the gateway receives and forwards a ghost package. Are there steps to reproduce this reliably? I was updating production to the latest version set a month ago.

mmrein commented 2 years ago

As I've written on forum, I would not expect the damage of hardware (assuming power levels used on LoRa), but this close proximity also is not what "LOngRAnge" is designed for.

You should be able to reproduce this simply by moving the device closer to the gateway. I currently have test device 2-3m from test gateway and I have been receiving quite a lot of ghost packets.

Solved simply by using 50ohm RF load instead of gateway's original antenna. It also gives me RF numbers closer to real-world conditions like -110dBm RSSI for example).

csanso-limit commented 2 years ago

Hello @brocaar, are there any plans to implement this fix and when can we expect it?

Although it is mentioned multiple times that it only occurs in close proximity our client is stating and proving otherwise, and he is upset because he has to deploy hundreds of nodes in the next two weeks with uplinks and downlinks, communicating uplinks and downlinks every 5 minutes, a join request once every 5 hours, which means an incorrect join will result in hours of lost information, and loss of downlinks.

I am confused however to why Chirpstack approves both duplicated join requests if they both have the same DevNonce, perhaps if Chirpstack did not accept the repeated DevNonce join request in most cases this error would not occur at all.

In our case it seems like the network server first receives the real join request with the correct frequency and then before having time to reply with a join accept it immeditely receives the second join request (duplicate, different frequency) and then proceeds to reply with a join accept for the first join request received.

And it seems like what ends up happening is that the end device receives the new DevAddr from the first join request but the Application Sever ends up with the DevAddr genereated from the second join request.

This morning the time difference between the duplicated join requests were 8 microseconds, that is why it has no time to reply before the duplicate is received. Times: 2021-12-09T08:31:58.439906Z and 2021-12-09T08:31:58.439914Z

Perhaps the quickest solution would be to reject the second duplicate join request because the DevNonce is the same as the previous join request DevNonce, not sure why it's not currently rejecting it, it could have to do with the small time difference.

brocaar commented 2 years ago

Yes, I'm planning to address this, but I don't have an ETA for this yet.

mmrein commented 2 years ago

Perhaps the quickest solution would be to reject the second duplicate join request because the DevNonce is the same as the previous join request DevNonce, not sure why it's not currently rejecting it, it could have to do with the small time difference.

I wouldn't be sure that the first packet is always the right one. Its the first one demodulated and received by the server, it sure is the one with strongest signal, but it doesn't say that gateway could not demodulate the incorrect one first.

Quick workaround you can try is to set default DataRatate of your device to higher value.

cairb commented 2 years ago

@brocaar we encountered this problem too and this is my initial fix. Please advise if you find this solution problematic:

brocaar commented 2 years ago

@cairb please note that the mutex only applies to a single instance. In case of multiple instances it doesn't prevent an other NS instance from handling the "ghost" join-request.

cairb commented 2 years ago

@brocaar Thanks for the quick reply. There is only one instance of NS running right now on an embedded ARM gateway. What will be the scenarios that require multiple NS instances?

brocaar commented 2 years ago

I have just pushed the following change: https://github.com/brocaar/chirpstack-network-server/commit/1b505944a63ebdb8b3556211d8f9e76e606f516a. I believe this should fix at least the duplicated OTAA accept in case of a ghost uplink.

brocaar / chirpstack-network-server