Downlink nack behavior for 1.0.x devices

adriansmares commented 4 years ago

Summary

The current behavior of the AS may lead to lost queues once a nack arrives, due to the way the downlink queue is recalculated.

Steps to Reproduce

Consider the following situation: 3 confirmed downlinks are sitting in the NS downlink queue, and none of them are scheduled yet:

The NS attempts to schedule the downlink with FCnt n:

For whatever reason, an ack is not received and as such the NS emits an nack to the AS:

At this point, the NS is allowed to already schedule a new downlink, before the AS handled the nack:

The AS will now handle the nack and attempt to recalculate the queue:

But since the green downlink has already been scheduled, the recalculate queue will be rejected since the FCnt of the first element is too low. As such, we lose the blue and yellow downlinks.

What do you see now?

The recalculated queue has an FCnt that is too low.

What do you want to see instead?

The recalculated queue should have a higher FCnt than the currently scheduled element, but how should we go about this ?

Environment

v3.10

How do you propose to implement this?

Not sure at the moment.

How do you propose to test this?

Add some delays in the nack validation and check that even after the second downlink was scheduled, the recalculation is successful.

Can you do this yourself and submit a Pull Request?

Yes, but I think we should clear up the behavior first. cc @johanstokking

johanstokking commented 4 years ago

At this point, the NS is allowed to already schedule a new downlink, before the AS handled the nack:

I think that the problem arises here. NS should not schedule a new downlink if a previous confirmed downlink is not acked. It should retry. The downlink frame is lost, so it can retry with the same FCntDown.

rvolosatovs commented 3 years ago

For whatever reason, an ack is not received and as such the NS emits an nack to the AS:

That does not happen - NS does not send a NACK if there's no uplink from device. For class B and C, however, NS goes to next downlink if class-based timeout is exceeded https://github.com/rvolosatovs/lorawan-stack/blob/26dd9be2cd78f9b1cee4fcc4f85102d7b7f657fc/pkg/networkserver/mac/utils.go#L65-L91

I think that the problem arises here. NS should not schedule a new downlink if a previous confirmed downlink is not acked. It should retry. The downlink frame is lost, so it can retry with the same FCntDown.

@johanstokking that is not what current behavior is and I think we have discussed this already offline and came to the conclusion that NS just tries next one and we would add an option to application downlinks, which would make NS wait for the particular confirmed downlink to be acked(or nacked) before proceeding to next application downlink in the queue

johanstokking commented 3 years ago

@johanstokking that is not what current behavior is and I think we have discussed this already offline and came to the conclusion that NS just tries next one and we would add an option to application downlinks, which would make NS wait for the particular confirmed downlink to be acked(or nacked) before proceeding to next application downlink in the queue

Hmm, what is the benefit then of confirmed downlink, if there's no automatic retry mechanism?

rvolosatovs commented 3 years ago

The fact that the application is notified whether it's been delivered or not (via ack/nack)

industrialinternet commented 3 years ago

..Hmm, what is the benefit then of confirmed downlink, if there's no automatic retry mechanism?

I've been seeing this issue I use confirmed downlinks to control a Vicki TRV radiator valves. When I look at the work needed at the app layer to ensure DL is received, I mined that it's too much effort. I'm a low-code scripter not a Dev and as more Actuators appear in the eco system the need for confirmed downlinks will grow. Just for ref even the LNS running inside a Multitech AEP GW re-queues a failed confirmed DL.

johanstokking commented 3 years ago

I also think that the default behavior should be to retry. It's what we did in V2 and it's what other LNSes are doing, although it's not strictly specified in the LoRaWAN spec.

Besides moving the retry logic to the application-layer, we may also get devices in an invalid state if some downlink messages make it and some not, or, if the application retried, in a different order than originally scheduled.

rvolosatovs commented 3 years ago

I also think the NS should retry confirmed downlinks and the end device ending up in an invalid state was exactly my point when we discussed this earlier @johanstokking. Glad that we're finally on the same page.

rvolosatovs commented 3 years ago

https://github.com/TheThingsNetwork/lorawan-stack/issues/3642

TheThingsNetwork / lorawan-stack