cap-js-community / event-queue

An event queue that enables secure multi-tenant enabled transactional processing of asynchronous events, featuring instant event processing with Redis Pub/Sub and load distribution across all application instances.
https://cap-js-community.github.io/event-queue/
Apache License 2.0
11 stars 1 forks source link

Handle Unsubscribed Tenants #160

Closed Kkoile closed 1 month ago

Kkoile commented 3 months ago

In our logging system we see a lot of logs like the following:

- Processing periodic events failed with unexpected error. Execution in new transaction failed: Acquiring client from pool timed out. Please review your system setup, transaction handling, and pool configuration. Pool State: borrowed: 0, pending: 0, size: 1, available: 0, max: 10 {
  eventType: 'EVENT_QUEUE_BASE_PERIODIC',
  eventSubType: 'DELETE_EVENTS',
  tenantId: 'abc,
  tenantIdBase: 'abc',
  globalTenantId: 'abc'
}
- Possibly stale credentials for tenant abc, re-trying with fresh credentials from BTP Service Manager
- Could not establish connection for tenant "abc" due to error: authentication failed: Detailed info for this error can be found with correlation ID '123'

The mentioned tenant has unsubscribed from our application. Hence, I guess the event-queue plugin does not recgonize if a tenant unsubscribes and still tries to handle events for it until the server instance restarts.

First, I thought simply hooking into the mtxs event handlers would be good enough, but in a multi instance setup one would need to distribute the unsubscription among all instances.

So I think the best approach would be to catch the authentication failed error when connecting to the DB and check against the service manager, whether the tenant still exists.

soccermax commented 3 months ago

The event-queue does detect this. The event-queue fetches all onboarded tenants from cds-mtxs and runs for all of those tenants. As the event-queue works with setTimeout to schedule jobs internally, it could happen that the job has been scheduled within the defined runInterval (default 25 min). These errors are all caught and only produce logs. But should not harm anything, it's actually a cosmetic issue. Does this match with your observation? Or are they any other issues with this?

Kkoile commented 3 months ago

These errors are all caught and only produce logs. But should not harm anything, it's actually a cosmetic issue.

Exactly, it's not impacting anything as they are just logs. But it spams our logs which is unnecessary and could be avoided. Also it leads to the impression that something is not working as it should. Only after the tenant ids have been looked up it is clear that these logs can be ignored.

soccermax commented 3 months ago

I understand your point. For this to work, cds-mtxs would need to distribute the offboard event to all instances. However, they unfortunately do not support cross-instance messaging. In the meantime, I could intercept those errors and prevent them from being logged (as you said). I will investigate this further.

soccermax commented 2 months ago

Current Plan is to register on offboard events of mtxs and federate this event to all application instances (that's what mtxs should actually be doing - this will be removed as soon as cds has implemented that). All instances will react on this event and cancel all planned events... this won't solve all error messages but will reduce them to a bare minimum.