Filter end device event types that are forwarded to applications

htdvisser commented 2 years ago

Summary

We need to filter which end device events are forwarded to the application event stream.

Why do we need this?

Because currently event streams can't keep up with large applications.

What is already there? What do you see now?

All end device events are forwarded to application event streams. This means that for applications with thousands of devices the event stream won't be able to keep up, and drop events. This makes the event stream pretty useless.

What is missing? What do you want to see?

We can think about a (possibly configurable) set of end device events that get forwarded to the application. My proposal would be to only forward application-layer events (starting with as.).

Environment

All deployments, but specifically The Things Stack Cloud.

How do you propose to implement this?

Define a list of end device event names that get forwarded to the application.

Can you do this yourself and submit a Pull Request?

Sure, but this first needs some discussion.

kschiffer commented 2 years ago

Related to #2887

htdvisser commented 2 years ago

The most chatty end device events are:

ns.up.data.receive (which is published for each duplicate)
ns.up.data.drop
ns.up.data.process
ns.up.data.forward
as.up.data.receive
as.up.data.forward
ns.down.data.schedule.attempt
ns.down.data.schedule.success
as.packages.storage.up.store
ns.mac.link_adr.request
ns.mac.dev_status.request
ns.mac.new_channel.request
ns.mac.link_adr.answer.accept
ns.mac.rx_timing_setup.request

Most of these are in my opinion not relevant to the application, and therefore don't need to be forwarded to application event streams.

kschiffer commented 2 years ago

I'm happy with any irrelevant event type that we can omit by default.

Also compare with what events we show by default in the Console:

  'as.*.drop',
  'as.down.data.forward',
  'as.up.location.forward',
  'as.up.data.forward',
  'as.up.service.forward',
  'gs.down.send',
  'gs.gateway.connect',
  'gs.gateway.disconnect',
  'gs.status.receive',
  'gs.up.receive',
  'js.join.accept',
  'js.join.reject',
  'ns.mac.*.answer.reject',

adriansmares commented 2 years ago

I'd like to see how things perform after https://github.com/TheThingsNetwork/lorawan-stack/pull/4831 . If this is still an issue, we can indeed start slashing a bit the NS events.

htdvisser commented 2 years ago

This is only partly related to #4831. While #4831 improves the performance on the event subscriber side, this issue is trying to improve the performance on the event publisher side, and of the pub/sub system (usually Redis).

adriansmares commented 2 years ago

My only concern is that this behavior is inconsistent - It's slightly weird that normal I would go to the application Live Data and see some events only, given that before I saw everything. But I fully agree that an incomplete and inconsistent stream won't help much (maybe only for programmatic consumers), and that this becomes worse with scale.

This means that for applications with thousands of devices the event stream won't be able to keep up, and drop events.

I can't fully find where this occurs in the publishing code - could you please give some pointers here ?

htdvisser commented 2 years ago

We currently don't actually drop them, which means that the event system will probably just grind to a halt, which would impact everyone instead of just the concerning application.

https://github.com/TheThingsNetwork/lorawan-stack/blob/bb54ef7af244e85b55c5311e64f41376fffbe919/pkg/events/basic/subscription.go#L67-L70

https://github.com/TheThingsNetwork/lorawan-stack/blob/bb54ef7af244e85b55c5311e64f41376fffbe919/pkg/events/redis/redis.go#L187-L193

https://github.com/TheThingsNetwork/lorawan-stack/blob/bb54ef7af244e85b55c5311e64f41376fffbe919/pkg/events/redis/store.go#L397-L405

KrishnaIyer commented 2 years ago

I also think only the following are relevant to application users, i.e, something that says "success" and something that helps debug "failure". Most of the time, I don't even look at the other events since they don't convey any information to me.

  'as.*.drop',
  'as.*.forward',

htdvisser commented 2 years ago

Okay so what about this:

EndDevice CRUD:

end_device.*
ns.end_device.*
as.end_device.*
js.end_device.*

AS Forwarded Uplinks/Downlinks:

as.*.forward

JS Joins:

js.join.accept

Errors and warnings:

*.fail
*.warning
*.drop
*.reject

NS MAC layer:

ns.class.switch.*

htdvisser commented 2 years ago

I'm going to implement this for the next milestone.

kschiffer commented 2 years ago

Looking into this now and remembered that the Console relies on certain event types to synchronize "hot data" such as

as.up.*.forward for Last activity (Currently not planned to be filtered out)
ns.up.data.process for uplink/downlink count

Filtering those out would result in the Console not being able to keep these data points updated. So we either need to find a replacement event or keep these events unfiltered. We already ran into this problem when moving to backend-enabled filters (#4831) in the backend and had to adjust the filtering accordingly. This caused a problem where we now have to show some events in the Console simply because we rely on it internally and we removed hiding of events on the frontend.

Also, keep in mind that changing what data is being displayed can cause confusion/frustration among users if not communicated properly. So we should make sure that users will understand why the event output changed, via the changelog and other communication channels.

htdvisser commented 2 years ago

The as.up.*.forward events will still be included in the stream of application events.
Do you really need ns.up.data.process in the stream of application events? It's still included in the stream of device events.
Obviously there will be a CHANGELOG item for this.

kschiffer commented 2 years ago

Do you really need ns.up.data.process in the stream of application events? It's still included in the stream of device events.

Yes, we could use the end device event stream to get this info, but it's currently still sourced from the app stream.

htdvisser commented 2 years ago

Re-assigning since my part is done. What's still TODO is to update the Console accordingly.

TheThingsNetwork / lorawan-stack