Scalability of application packages traffic processing

adriansmares commented 3 years ago

Summary

The current application packages traffic processing pipeline is strictly speaking serial and this can lead to lost traffic on stalls.

Why do we need this?

In order to avoid losing traffic when certain packages stall the traffic.

What is already there? What do you see now?

The current pipeline works as follows:

The uplink arrives via the applicationpackages local subscription.
The application packages for which the uplink is relevant are computed.
Serially, each handler is called.

What is missing? What do you want to see?

If a certain handler stalls (for example, one that does HTTP requests that may timeout on a network partition), the subscription channel will fill up and stop accepting new uplinks. The follow up on this is that traffic is lost for all packages, if a single one stalls.

Environment

v3.11.1, but the pipeline has been working like this since the beginning.

How do you propose to implement this?

I propose that the pipeline is parallelized in the following manner:

Initialization:

During the creation of application packages frontend, create a subscription for each package.

Runtime:

The uplink arrives via the applicationpackages local subscription (as it was before)
The possible application packages are computed.
We fan out the uplink to each application package subscription.

The advantage of this approach is that one application package cannot make others miss traffic. It also parallelizes the processing.

One detail that we have to take into account is that the packages themselves receive more than just the io.ContextualApplicationUp message - they also receive the current state of the association.

My proposal would be to change io.Subscription to work (i.e. to use in the channel) with the following interface:

type SubscriptionMessage interface {
  ApplicationUp() *ttnpb.ApplicationUp
  Context() context.Context
}

io.ContextualApplicationUp can implement this interface by default, making the required change minimal. But the applicationpackages frontend can use an extended message type that will also contain the package state, to be delivered to each package handler.

How do you propose to test this?

Set timeout to infinite in the LoRaCloud DMS http.Client, set the remote server to a slow loris, start pumping traffic. If the buffer_full error occurs only for the LoRaCloud DMS frontend, while the other still process and receive their traffic, everything is fine.

Can you do this yourself and submit a Pull Request?

Yes. cc @johanstokking

This issue does not introduce breaking changes and may be delivered in a patch release as well. I'm just choosing v3.12 here since I can't pick something like v3.11.4 or .5.

johanstokking commented 3 years ago

During the creation of application packages frontend, create a subscription for each package.

[...]

The advantage of this approach is that one application package cannot make others miss traffic. It also parallelizes the processing.

Then there should be parallelization in the package too, right? Otherwise one application performing long DMS calls can still stall other applications using DMS too.

We fan out the uplink to each application package subscription.

If the fanout is a channel send? Is that waited upon with a timeout, and then also parallel and are all channel sends awaited upon to avoid a goroutine blow up?

adriansmares commented 3 years ago

Then there should be parallelization in the package too, right? Otherwise one application performing long DMS calls can still stall other applications using DMS too.

We can do that indeed. The reason why I didn't mention it was because packages can modify the state of the device / application and it would mean they race by default. Nonetheless, we don't have any that do this, so it should be fine.

We can also make the distribution stable (i.e. having a subscription per worker, and always hashing the same end device UID to the same task).

If the fanout is a channel send? Is that waited upon with a timeout, and then also parallel and are all channel sends awaited upon to avoid a goroutine blow up?

PublishUp for io.Subscription does not wait - if the channel is full it just returns an error. So it's always constant time to push messages.

TheThingsNetwork / lorawan-stack