Support downstream resending traffic out-of-order

johanstokking commented 4 years ago

Summary

Replaces https://github.com/TheThingsIndustries/lorawan-stack/issues/2114

Why do we need this?

For processing (application) data while new data may already have been received.

This may happen already without the Network Server being aware of it. Currently, this triggers downlink tasks and even downlink transmission while this shouldn't happen.

Typical scenarios where this happens is gateways that are buffering traffic while the upstream is unavailable. This can be terrestrial gateways and satellites that send buffered data to a ground station.

What is already there? What do you see now?

Upstream handling that require FCnt increases in the session.

What is missing? What do you want to see?

Flagging uplink message that have been buffered
Network Server processing traffic but not triggering downlink tasks
Application Server processing traffic
Whether buffered traffic may be processed would be a per-device setting owned by Network Server
Traffic may arrive out-of-order

How do you propose to implement this?

Add process_buffered_messages (BoolValue) field to EndDevice
NS can only do a minimal check whether the message has already been received in the session; only recent uplinks are kept. This means that buffered traffic that has been seen by other gateways may result in duplicates; so it becomes at least once delivery towards applications. That side effect is acceptable
The upstream handling shouldn't trigger downlink tasks
The field needs to be added included in the uplink message to applications

How do you propose to test this?

Unit testing handling an uplink message with a lower FCnt in the session

Can you do this yourself and submit a Pull Request?

Can review

cc @tftelkamp

adriansmares commented 4 years ago

The current behavior of uplink traffic between NS and AS (even after https://github.com/TheThingsNetwork/lorawan-stack/issues/2312) is that communication works in a lock-step mode:

NS sends an ApplicationUp and waits for the empty message
AS receives the ApplicationUp and synchronously processes the message, then sends the empty message
NS receives the empty message and continues with the next ApplicationUp

This lockstep occurs due to the possible downlink queue operations / invalidations, and makes sense with 'live traffic'. But this out-of-band traffic shouldn't cause this kind concurrent access issues. Because of this, the NS can actually send-and-forget this kind of traffic.

On the AS side, this opens up the possibility to actually store all of the incoming out-of-band traffic into a message queue and process it asynchronously. This fits nicely with the already proposed internal AS PubSub infrastructure of https://github.com/TheThingsNetwork/lorawan-stack/issues/2312 .

Does this fit with what you had in mind @johanstokking ?

johanstokking commented 4 years ago

Yes exactly, this can only be implemented correctly with #2312. I wouldn’t touch the existing NS-AS hot path API behavior, I would wrap it in a box and throw the whole thing away.

rvolosatovs commented 4 years ago

So if a "buffered" uplink is received by NS:

match to the device
if process_buffered_messages
- then directly forward to AS without anyhow mutating the device proto or downlink tasks
- else drop uplink

The (1.) part is a bit problematic, since once #2837 is merged, NS will not attempt to match to devices, for which a higher FCnt is expected, but that can be made configurable, even though I would prefer not having to do that. In fact, after #2837 NS will also cache the matching result by the payload hash, which means that if the uplink was matched to a device successfully within the cache TTL (which is set to 1 minute for now), it will be immediately matched. Perhaps, we could set higher cache TTL for devices, which would accept buffered traffic. When we're talking about buffering, what time intervals are in mind? Is it on the scale of minutes, hours or days?

johanstokking commented 4 years ago

NS will not attempt to match to devices, for which a higher FCnt is expected, but that can be made configurable, even though I would prefer not having to do that

Don't we need this anyway for devices allowing frame counter resets? Wouldn't process_buffered_messages be implicitly allowing frame counter resets (without persisting the reset value)?

Is it on the scale of minutes, hours or days?

Days, worst case. In practice, hours.

rvolosatovs commented 4 years ago

NS will not attempt to match to devices, for which a higher FCnt is expected, but that can be made configurable, even though I would prefer not having to do that

Don't we need this anyway for devices allowing frame counter resets? Wouldn't process_buffered_messages be implicitly allowing frame counter resets (without persisting the reset value)?

Indeed, NS can just handle this like the FCnt reset in registry

Is it on the scale of minutes, hours or days?

Days, worst case. In practice, hours.

Ok, I think we can then make cache TTL 2 hours for devices handling buffered traffic, that means that if a duplicate buffered message is received within 2 hours of receiving the original, NS will not traverse the whole DevAddr space, but immediately know the device that matches.

johanstokking commented 4 years ago

that means that if a duplicate buffered message is received within 2 hours of receiving the original, NS will not traverse the whole DevAddr space, but immediately know the device that matches.

Fine.

Beyond a certain window, I'm also fine with at-least-once delivery.

Note that a cache of 2 hours of all uplink messages is already a few million entries with current TTN traffic.

jpmeijers commented 3 years ago

With V2 becoming read-only very soon, I am forced to use V3 for my Lacuna test devices. But V3 does not allow for receiving out-of order frames yet, which results in all my satellite traffic from being dropped.

Can we perhaps prioritise this, or maybe have a workaround like V2's "disable frame counter checks"?

jpmeijers commented 3 years ago

Is it on the scale of minutes, hours or days?

Using Lacuna it's about 12 hours on average.

Screenshot from 2021-06-07 13-55-14

johanstokking commented 3 years ago

Lacuna Space should use Packet Broker.

tftelkamp commented 3 years ago

We are working right now to switch the productions servers to Packet Broker. Is traffic out-of-order traffic then accepted, or is there anything we or the device needs to set?

jpmeijers commented 3 years ago

@johanstokking I am also unsure how "use Packet Broker" will solve "packets out of order being discarded". Does the Packet Broker do something special that will allow this?

johanstokking commented 3 years ago

The first problem is that when using an UDP bridge, our rate limiters kick in before the Gateway Server even decodes the frame. These rate limiters are based on origin (IP + remote port) and (later in the pipeline) the source gateway ID within the tenant. That won't happen with Packet Broker, at least not so soon.

The other issue, that the Network Server accepts lower FCnt as valid uplink, but doesn't update the MAC state and doesn't trigger downlink, is the scope of this issue. Indeed, that isn't supported yet. So if a terrestrial gateway picked up a more recent frame than is being processed later from a satellite gateway, the Network Server will not process the packet for now. That is what we track in this issue.

jpmeijers commented 3 years ago

That is what we track in this issue.

yes...

Is there any workaround available that I can use to make it work now? Or do I have to stick to V2?

johanstokking commented 3 years ago

V2 rejects lower FCnt as much as V3 does. Or do I misunderstand the issue here?

jpmeijers commented 3 years ago

V2 has a "Disable Frame Counter Checks" settings that we used to make it work. V3 does not have this feature. I tried the "Frame Counter Resets" option, but this doesn't work the same.

johanstokking commented 3 years ago

Yes, mac-settings.resets-f-cnt works the same way in V3. Or at least, it should. If it doesn't, there's probably a different issue.

tftelkamp commented 3 years ago

"Frame Counter Resets" might be the same in V2 and V3, but it is something different from the "Disable Frame Counter Checks" in V2, and that is what we need. Preferably something better/more secure.

jpmeijers commented 2 years ago

At the moment we are being reminded that we need to migrate our devices from V2 to V3 before 1 December. One of my applications - izinto_lacuna_prototypes - relies on disabling frame counter checks. The frame counter checks in V3 is preventing me from using these devices on V3.

What am I supposed to do?

johanstokking commented 2 years ago

@jpmeijers this issue is about gateways sending traffic out-of-order, not individual end devices that only use FCnt = 0.

We cannot reliably support end devices that only transmit with FCnt = 0. If you really can't do anything else, at least use a random FCnt.

jpmeijers commented 2 years ago

The Lacuna device has a single ABP session used for both terrestrial and space gateways. Every 30 minutes the devices sends a message via the terrestrial network. When a satellite is overhead the devices sends a couple of messages via the space network.

The space side of things buffer messages and forwards them to TTN when the satellite is over a ground station. This can happen anywhere between 30 minutes to 24 hours after sending.

The periodic terrestrial uplinks makes TTN V3 discard all lower frame counters. That is all frame counts used when uplinking via the satellite.

The issue is therefore due to packets arriving at TTN out of order.

On V2 the workaround was to disable frame counter checks to allow the "out of order" messages received and forwarded by the space network to not be discarded.

johanstokking commented 2 years ago

I see the use case. Do note that V3 does support resetting FCnt, but from 0 to 0 is not considered a reset.

I just spoke with the Lacuna guys and what we're likely going to do is support multiple sessions within an end device.

jpmeijers commented 2 years ago

multiple sessions

Yes that is indeed a workaround that has been used recently. It does however mean one loses the "seamless roaming" between the two networks like we had before, and how this was advertised.

johanstokking commented 2 years ago

It does however mean one loses the "seamless roaming" between the two networks like we had before, and how this was advertised.

Advertised by whom? What we advertise is LoRaWAN compatibility.

jpmeijers commented 2 years ago

https://youtu.be/iwocSYupdIQ?list=PLM8eOeiKY7JV5KMwomW4cJrKB42ItPyey&t=28

tftelkamp commented 2 years ago

Regardless if "seamless roaming" implies a single session or not, ignoring frame counters on ABP sessions is something we need to move away from, even though it can be very convenient for testing. Let's focus on a secure solution, and make that seamless.

jpmeijers commented 2 years ago

Let's focus on a secure solution, and make that seamless.

Yes indeed. That is what this issue is originally about. Lacuna is only one example where this is required.

A while back I've also tried starting a discussion on the Basic Station front:

https://github.com/lorabasics/basicstation/issues/115

TheThingsNetwork / lorawan-stack