PostHog / posthog

đŸ¦” PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
19.68k stars 1.16k forks source link

Data Deletion from PostHog's databases #20221

Open tiina303 opened 5 months ago

tiina303 commented 5 months ago

Is your feature request related to a problem?

PII (Personal Identifiable Information) can be sent to PostHog and end up being stored in our databases. Users need a way to be able to request deletion of that data.

The api and UI allow triggering data deletions on a person level, however the deletion system is currently experiencing difficulties & might not cover data in all the systems, e.g. in addition to events data the distinct_id itself could also contain PII.

Secondly sometimes users would like to delete PII based on a time range, e.g. there was an implementation error that for a couple of minutes sent PII, then we'd ideally delete data only during that time frame and not the full history.

Describe the solution you'd like

  1. Fix the person level data deletion system & improve its reliability/monitoring/alerting.
    1. Ensure related data gets deleted from all systems
  2. Provide a solution for time based data deletion
  3. Business hours alerting on the system working
  4. Document that unset isn't removing PII

Describe alternatives you've considered

PostHog has events deduplication, so users could consider sending updated events with the same uuid without PII. The problem with this approach is that

  1. ClickHouse deduplication is eventual, so there isn't a guarantee for when the data would get deleted
  2. Re-submitting events could have unintended side-effects, e.g. user properties are set at the time of ingestion, so sending a day old event again might override the person properties that had been updated afterwards
  3. it would be quite cumbersome to re-submit a lot of events and that would cause increased load on our ingestion system

Additional context

Data modification is a very complex operation, so we don't plan to offer that at this point in time.

Related internal threads:

  1. https://posthog.slack.com/archives/C0460J93NBU/p1706730424537289?thread_ts=1706270512.980269&cid=C0460J93NBU
  2. https://posthog.slack.com/archives/C0460J93NBU/p1707140274057659
  3. https://posthog.slack.com/archives/C0185UNBSJZ/p1707027331494989?thread_ts=1707027291.434259&cid=C0185UNBSJZ
  4. https://posthog.slack.com/archives/C0460J93NBU/p1707142640395949?thread_ts=1707140274.057659&cid=C0460J93NBU
  5. https://posthog.slack.com/archives/C0374DA782U/p1707168312879699
bretthoerner commented 4 months ago

For item 1,

Fix the person level data deletion system & improve its reliability/monitoring/alerting

The AsyncEventDeletion query breaks when the number of pending deletions gets too large, because it can grow well over any reasonable CH max_query_size. This affects Team, Group, Person and Cohort pending deletions.

Our plan is to write pending deletes to a CH table so that we can do the deletion via a JOIN, so that this is scalable for the long term.

After that, I can write alerts for delete_verified_at is null and created_at is too large.

tkaemming commented 4 months ago

Also, I'm not sure if it make sense to be part of this ticket or better kept as something separate (but related): event deletions for persons right now delete based on the person_id which is going to potentially be out-of-date if the person hasn't been fully squashed across all partitions, allowing events that should have been deleted to stick around.

bretthoerner commented 4 months ago

Also, I'm not sure if it make sense to be part of this ticket or better kept as something separate (but related): event deletions for persons right now delete based on the person_id which is going to potentially be out-of-date if the person hasn't been fully squashed across all partitions, allowing events that should have been deleted to stick around.

Good call out, maybe the new JOIN shenanigans could use all of the Person's distinct_ids?

tkaemming commented 4 months ago

Good call out, maybe the new JOIN shenanigans could use all of the Person's distinct_ids?

Yeah, seems like it'd be possible to use the overrides table here as well (at least in theory, I'm not familiar enough offhand with the order-of-operations during deletions here to know if that's totally accurate as-is.)