Analyze cause of duplicate events in db

acn-sbuad commented 1 year ago

Description

Analyze

Additional Information

No response

Tasks

No elements in poison queues, but a number of duplicates in db

select cloudevent->>'id', count(cloudevent->>'id') as noDuplicates from events.events
where sequenceno > 6717116
group by cloudevent->>'id'
order by noduplicates desc

SELECT 
    cloudevent->>'id' AS id, 
    COUNT(cloudevent->>'id') AS noDuplicates, 
    cloudevent->>'resource' AS resource,
    ARRAY_AGG(registeredtime) AS registered_times
FROM events.events
WHERE sequenceno > 24763608
GROUP BY cloudevent->>'id', cloudevent->>'resource'
HAVING COUNT(cloudevent->>'id') > 1

Hypothesis

No need for the inbound endpoint in events if function can push elements directly to queue. Conclusion: Did not fix the problem of duplicates in the database, however it does save us 1 lookup in keyvault per processed cloud event. Can't quite remember why we implemented it like this is the first place, is there a reason function cannot return the cloud event directly to the next queue ?

== > Directy using an out binding for the function resulted in some lost events. Will need to find out if we can change the function config [return: Queue("events-inbound", Connection = "QueueStorage")]

Hypothesis

Duplicates occur due to exhaustion of connections to key vault Conclusion: The connection to KV fails far more often than we see duplicates.

Hypothesis 05.08.24

Duplicates are in large created during deploy. Defining a preStop hook in the HELM deployment can allow us to postpone the shutdown process, potentially allowing the pod to complete all ongoing requests before being shut down. Functions log // PostInbound event with id 661cc13f-9b21-4af2-9639-6de0e845aead failed with status code GatewayTimeout. Docs

Acceptance Criterias

No response

acn-sbuad commented 1 year ago

Next steps: add logging to events component and deploy to yt01 to log all event ids that are sent to the storage endpoint

acn-sbuad commented 1 year ago

last duplicate in tt02 was created "2023-06-05 12:32:57.818753+00". Any changed implemented around this time, @SandGrainOne ?

acn-sbuad commented 1 year ago

Last duplicates in production "2023-06-12 08:16:24.930741+00". Which kind of matches the deployment schedule, I guess.

annerisbakk commented 7 months ago

@acn-sbuad Kan du sjekke om det finnes duplikater siden sist det ble sjekket? Dersom det ikke er noen, kan kanskje denne lukkes?

acn-sbuad commented 7 months ago

8 events har blitt duplisert siste 90 dagene. @annerisbakk FYI

olebhansen commented 1 month ago

FYI: As of 2024-08-05, there were 71 events with 2 or more entries during the past 90 days. This issue is still relevant...

olebhansen commented 1 month ago

Continued in #573

olebhansen commented 1 month ago

Things to check/understand (read docs?):

that a pod under (gracefull) shutdown is able to respond back to the function "200 OK"

HenningNormann commented 3 weeks ago

with duplicates as (
    select cloudevent->>'id' as id from events.events
    where cloudevent->>'resource' like 'urn:altinn:resource:app%'
    group by cloudevent->>'id'
    having count(cloudevent->>'id') > 1
)
select * from events.events e join duplicates on e.cloudevent->>'id' = duplicates.id order by registeredtime

Can't find any duplicates in prod or tt02. (Data older then 90 days are deleted.) Should we postpone further analysis until the problem is observed again?

Altinn / altinn-events