Document additional v3 to v3 migration details

adriansmares commented 2 years ago

Summary

The v3-to-v3 migration path can contain additional debug steps until we have better support in ttn-lw-migrate.

Why do we need this ?

In order to ease the migrations between v3 distributions.

What is already there? What do you see now?

https://www.thethingsindustries.com/docs/getting-started/migrating/migrating-from-ce-to-ch/migrate-active-session/

What is missing? What do you want to see?

For situations in which the the gateways cannot easily be migrated (i.e. they are not accessible in order to have them point to the new v3 environment), and the end devices will rejoin automatically if the Network Server does not respond to MAC commands we have a possible way to avoid data loss. We may want to document this path.

How do you propose to document this?

The idea is as follows: We would like to keep the end device session in the source v3 cluster, but this source cluster should not be able to send downlinks to the end device, and it should not be able to process join requests. Given these two prerequisites, we can process all of the traffic from the end device while also forcing it to rejoin.

This can be achieved as follows:

In the source end device cluster change mac_settings.schedule_downlinks to false
- This ensures that the Network Server does not attempt schedule any downlinks (MAC commands, data downlinks, join accepts).

This can easily be automated using the CLI:

ttn-lw-cli dev set app1 dev1 \
   --mac-settings.schedule-downlinks=false

This requires The Things Stack v3.21 (both CLI and stack).

Can you do this yourself and submit a Pull Request?

Can review. cc @johanstokking - what do you think about documenting this migration path ? I think it should allow 0 data loss for end devices that do rejoin when no MAC responses are sent.

johanstokking commented 2 years ago

We would like to keep the end device session in the source v3 cluster

Why?

adriansmares commented 2 years ago

We would like to keep the end device session in the source v3 cluster

Why?

This issue is strictly about situations in which the gateway cannot be moved from TTSCE to TTSC. Since the device address prefixes differ, you cannot use PB for roaming. In this interval of item between the re-registration and re-join, uplinks would be lost if we do not keep the session in the source v3 cluster. This issue describes how to basically have the old session in the source v3 cluster for 'uplink only', while ensuring that the target v3 cluster will be the one answering the eventual Join Request.

johanstokking commented 2 years ago

I see. LoRa Alliance has device migration between networks also on the radar now, and session migration is one of the apporaches. It would (only) work with (temporarily) routing traffic to both the old and the new destination.

That said, should we consider these two alternative approaches?

Route TTSCE to TTSC clusters via Packet Broker. NS currently does one ZRANGE WITHSCORES and one HGETALL per session (current and pending). In TTSCE eu1, the current rate is a bit over 400 pps, so 1600 Redis commands extra per second in TTSC eu1 and eu2. This will align best with LoRa Alliance comes up with
Add a field in Network Server to disable downlink and activations, basically marking the device as pending emigration. This is abit more elegant than corrupting the session and keys

(1) and (2) can be complementary

adriansmares commented 2 years ago

We do not have a good experience with the word 'temporary' when it comes to migrations. Once the genie is out of the bottle, I don't think that we can turn off the traffic rule with ease, and this becomes debt. I'm ok with this kind of debt, but I think we should frame it as a permanent change from the start.

Matching wise, we can handle the extra traffic since matching is done on the read only replica. We will also pay some CPU cost on the Network Server side (unmarshalling and RPC overhead mainly, but matching I expect to be either a hit, or 0 returned results from Redis).
See if https://github.com/TheThingsNetwork/lorawan-stack/pull/5634 fits what you had in your mind.

johanstokking commented 2 years ago

Regarding (2), which I'll review next week, does it include disallowing activations?

adriansmares commented 2 years ago

Regarding (2), which I'll review next week, does it include disallowing activations?

It disables any form of scheduling on the Network Server side - be it a join accept, class A downlink or network initiated downlink.

Edit: The original issue body now contains the shortened instructions to disable downlinks.

TheThingsIndustries / lorawan-stack-docs