Closed acn-sbuad closed 4 days ago
Next steps: add logging to events component and deploy to yt01 to log all event ids that are sent to the storage endpoint
last duplicate in tt02 was created "2023-06-05 12:32:57.818753+00". Any changed implemented around this time, @SandGrainOne ?
Last duplicates in production "2023-06-12 08:16:24.930741+00". Which kind of matches the deployment schedule, I guess.
@acn-sbuad Kan du sjekke om det finnes duplikater siden sist det ble sjekket? Dersom det ikke er noen, kan kanskje denne lukkes?
8 events har blitt duplisert siste 90 dagene. @annerisbakk FYI
FYI: As of 2024-08-05, there were 71 events with 2 or more entries during the past 90 days. This issue is still relevant...
Continued in #573
Things to check/understand (read docs?):
with duplicates as (
select cloudevent->>'id' as id from events.events
where cloudevent->>'resource' like 'urn:altinn:resource:app%'
group by cloudevent->>'id'
having count(cloudevent->>'id') > 1
)
select * from events.events e join duplicates on e.cloudevent->>'id' = duplicates.id order by registeredtime
Can't find any duplicates in prod or tt02. (Data older then 90 days are deleted.) Should we postpone further analysis until the problem is observed again?
Description
Analyze
Additional Information
No response
Tasks
No elements in poison queues, but a number of duplicates in db
Hypothesis
No need for the inbound endpoint in events if function can push elements directly to queue. Conclusion: Did not fix the problem of duplicates in the database, however it does save us 1 lookup in keyvault per processed cloud event. Can't quite remember why we implemented it like this is the first place, is there a reason function cannot return the cloud event directly to the next queue ?
== > Directy using an out binding for the function resulted in some lost events. Will need to find out if we can change the function config
[return: Queue("events-inbound", Connection = "QueueStorage")]
Hypothesis
Duplicates occur due to exhaustion of connections to key vault Conclusion: The connection to KV fails far more often than we see duplicates.
Hypothesis 05.08.24
Duplicates are in large created during deploy. Defining a preStop hook in the HELM deployment can allow us to postpone the shutdown process, potentially allowing the pod to complete all ongoing requests before being shut down. Functions log
// PostInbound event with id 661cc13f-9b21-4af2-9639-6de0e845aead failed with status code GatewayTimeout
. DocsAcceptance Criterias
No response