TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
975 stars 306 forks source link

Improve event backends resilience #3191

Closed adriansmares closed 4 years ago

adriansmares commented 4 years ago

Summary

Event backends (pub/subs) need to be more resilient in case of failures. Additionally, we need more logging presence, in order to be able to debug failures.

Why do we need this?

In order to automatically reconnect on connection failures, and in order to allow better debugging capabilities.

What is already there? What do you see now?

https://github.com/TheThingsNetwork/lorawan-stack/blob/7f5fb0457fb872e1a152a358916922b2701aa3c7/pkg/events/cloud/cloud.go#L45-L50

https://github.com/TheThingsNetwork/lorawan-stack/blob/7f5fb0457fb872e1a152a358916922b2701aa3c7/pkg/events/redis/redis.go#L62-L66

This has the following issues:

What is missing? What do you want to see?

Environment

v3.9.1

How do you propose to implement this?

Add the missing retry mechanism and improve the logging, in places where errors can occur.

How do you propose to test this?

Trigger 'partitions' by killing Redis / the gocloud Pub/Sub.

Can you do this yourself and submit a Pull Request?

Yes. cc @htdvisser @rvolosatovs

rvolosatovs commented 4 years ago

I agree with everything above. In terms of Redis implementation, is there a reason to use Pub/Sub, when we could easily achieve fail-safety and exactly-once delivery semantics with streams? It is already used in Event Server. Consumer part: https://github.com/TheThingsIndustries/lorawan-stack/blob/b795deb83158c7732a0fbd3e9b6dc6d941a844ff/pkg/eventserver/eventserver.tti.go#L34-L55 Publisher part: https://github.com/TheThingsIndustries/lorawan-stack/blob/b795deb83158c7732a0fbd3e9b6dc6d941a844ff/pkg/eventserver/eventserver.tti.go#L87-L105

Debugging is also a breeze, because we'd have an event log as long as we want - we set the limit to N and Redis will always store last N entries regardless of whether they're processed or not.

adriansmares commented 4 years ago

I agree with everything above. In terms of Redis implementation, is there a reason to use Pub/Sub, when we could easily achieve fail-safety and exactly-once delivery semantics with streams? It is already used in Event Server. Consumer part: https://github.com/TheThingsIndustries/lorawan-stack/blob/b795deb83158c7732a0fbd3e9b6dc6d941a844ff/pkg/eventserver/eventserver.tti.go#L34-L55 Publisher part: https://github.com/TheThingsIndustries/lorawan-stack/blob/b795deb83158c7732a0fbd3e9b6dc6d941a844ff/pkg/eventserver/eventserver.tti.go#L87-L105

Debugging is also a breeze, because we'd have an event log as long as we want - we set the limit to N and Redis will always store last N entries regardless of whether they're processed or not.

The Redis events PubSub was developed before Redis 5 was a thing - there were no streams in early 2018.

I'll pick up the task-based approach for subscribe in the fix PR, similar to how ES does it.

While I do see your point regarding storage (compared to normal Redis Pub/Sub), I think in the current format of the pkg/events PubSub interface, subscription groups are useless, since Subscribe has broadcast semantics anyway (so we cannot use XREADGROUP).