TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
994 stars 309 forks source link

Confirm end device session based on network server messages #3649

Closed adriansmares closed 3 years ago

adriansmares commented 3 years ago

Summary

The application server should confirm the end device session (i.e. move dev.PendingSession into dev.Session) on the following events:

Why do we need this?

Currently, it is possible that the NS 'switches' the session without the AS knowing this switch occurred. @rvolosatovs reproduced this in v3.11 using the following sequence.

At this point the AS will never be able to schedule downlinks again in this session unless the NS sends another invalidation some time in the future, because it basically rejected the FCnt increase that happened when the NS sent a FPort=0 downlink (and now the FCnt is always too low).

What is already there? What do you see now?

The session won't recover unless an invalidation occurs in the future.

What is missing? What do you want to see?

Environment

v3.11

How do you propose to implement this?

How do you propose to test this?

Try to reproduce the sequence mentioned in the reproduction steps.

Can you do this yourself and submit a Pull Request?

Yes, but as this is a non-trivial change I'm asking tagging this issue first as discussion - do we want to introduce these changes ? The downlink queue invalidation one seems a requirement, but the other ones are good for consistency.

cc @rvolosatovs

rvolosatovs commented 3 years ago

I am in favor of this, as we have already discussed. Application Server should always trust Network Server, because it always has the most up-to-date data about the device session.

johanstokking commented 3 years ago
  • End Device sends FPort=0 uplink - AS won't receive this uplink

Shouldn't we change this so that NS does send this, but with empty payload and FPort 0, so that it won't be sent upstream?

rvolosatovs commented 3 years ago
  • End Device sends FPort=0 uplink - AS won't receive this uplink

Shouldn't we change this so that NS does send this, but with empty payload and FPort 0, so that it won't be sent upstream?

It's redundant, since NS->AS messaging is async, queue invalidation can arrive before the FPort==0 uplink is sent to AS to confirm the session, so we have to do this anyway. The only reason to send an uplink to AS in response to FPort==0 uplink to NS, would be to ensure AS is notified of session change as soon as possible, but we don't have such need. Even then it would make way more sense to introduce a SessionSwitch message, which NS would send to AS instead of the FPort 0 uplink.

adriansmares commented 3 years ago

I have upgraded the priority to prio/high as it is affecting v3.11.1 deployments.

The changes should be the following:

The following proto addition (field 3) should suffice.

message ApplicationInvalidatedDownlinks {
  repeated ApplicationDownlink downlinks = 1;
  uint32 last_f_cnt_down = 2;
  bytes session_key_id = 3 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
}

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1020-L1038

Instead of using the dev.Session always, do a switch on SessionKeyID in order to establish which session to use. If required, update the current dev.Session.

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1060-L1073

As before, do a switch on the SessionKeyID in order to determine which session to use. If required, update the current dev.Session.

The following proto addition should be filled and be added as error details to the errFCntTooLow in the NS:

message UpdateDownlinkQueueErrorDetails {
  bytes session_key_id = 1 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
  uint32 last_f_cnt_down = 2;
}

The AS can then take these details and update the current session.

The changes are backwards compatible and hopefully minimal on the NS side. The genie is already out of the bottle - the whole protocol between AS and NS slowly became asynchronous, and simply reverting the FPort=0 change won't be enough. I don't think that this transformation was wrong at the end of the day, but we must fix these quirks regarding session bisimulation.

cc @johanstokking, @rvolosatovs

johanstokking commented 3 years ago

Sounds good to me!

johanstokking commented 3 years ago

Anything I can do here? If so, please re-assign me and let me know what.

adriansmares commented 3 years ago

Anything I can do here? If so, please re-assign me and let me know what.

I'm already working on this, with a huge emphasis on the following point:

* (Optional) return error details on `DownlinkQueue{Push|Replace}` with the minimum `FCnt` and always update the `LastAFCntDown` to this value. This would ensure that the system converges if at any point we're for some reason out of sync between AS and NS.

The reason for this is that taking actions based on the events received from the NS, as part of the queue, is fundamentally not really enough:

Given these characteristics, there are two options:

johanstokking commented 3 years ago
  • Just trust the NS. What I mean by this, is that push/replace operations return the dev.Session.SessionKeyID, dev.PendingSession.SessionKeyID, and the LastAFCntDown, as part of the error details. We then use the error details to rebuild the session in the AS. Fundamentally this means that when we try to do a downlink queue operation, using outdated data (perhaps an outdated session, perhaps a FCnt too low), we eventually converge to the NS state. It may take one, two, three tries, I'll make it bounded in order to avoid infinitely spinning, but we're at least operating with information that's significantly newer than the one from the uplink messages queue.

I think this is the best option.