Open tiina303 opened 5 months ago
For item 1,
Fix the person level data deletion system & improve its reliability/monitoring/alerting
The AsyncEventDeletion
query breaks when the number of pending deletions gets too large, because it can grow well over any reasonable CH max_query_size
. This affects Team
, Group
, Person
and Cohort
pending deletions.
Our plan is to write pending deletes to a CH table so that we can do the deletion via a JOIN, so that this is scalable for the long term.
After that, I can write alerts for delete_verified_at is null and created_at
is too large.
Also, I'm not sure if it make sense to be part of this ticket or better kept as something separate (but related): event deletions for persons right now delete based on the person_id
which is going to potentially be out-of-date if the person hasn't been fully squashed across all partitions, allowing events that should have been deleted to stick around.
Also, I'm not sure if it make sense to be part of this ticket or better kept as something separate (but related): event deletions for persons right now delete based on the
person_id
which is going to potentially be out-of-date if the person hasn't been fully squashed across all partitions, allowing events that should have been deleted to stick around.
Good call out, maybe the new JOIN shenanigans could use all of the Person's distinct_ids?
Good call out, maybe the new JOIN shenanigans could use all of the Person's distinct_ids?
Yeah, seems like it'd be possible to use the overrides table here as well (at least in theory, I'm not familiar enough offhand with the order-of-operations during deletions here to know if that's totally accurate as-is.)
Is your feature request related to a problem?
PII (Personal Identifiable Information) can be sent to PostHog and end up being stored in our databases. Users need a way to be able to request deletion of that data.
The api and UI allow triggering data deletions on a person level, however the deletion system is currently experiencing difficulties & might not cover data in all the systems, e.g. in addition to events data the distinct_id itself could also contain PII.
Secondly sometimes users would like to delete PII based on a time range, e.g. there was an implementation error that for a couple of minutes sent PII, then we'd ideally delete data only during that time frame and not the full history.
Describe the solution you'd like
Describe alternatives you've considered
PostHog has events deduplication, so users could consider sending updated events with the same uuid without PII. The problem with this approach is that
Additional context
Data modification is a very complex operation, so we don't plan to offer that at this point in time.
Related internal threads: