Re-using distinct_ids after person was deleted is racy

macobo commented 2 years ago

Bug description

Slack thread: https://posthog.slack.com/archives/C0374DA782U/p1662025370798299

Summary:

If we delete a person, we insert a 0-version row into person_distinct_id2 table on clickhouse and delete the row in postgres.
- Bug 1: person_distinct_id2 table is replacing by version, so there's less than 50% chance this deletion sticks and 0% chance if the user was identified after doing some events.
If the same distinct_id is seen again in ingestion:
- Bug 2 We cache seen distinct_ids for up to 4 hours, meaning we won't create persons without properties for a while
- Bug 3 When re-creating the person, we insert the person_distinct_id2 and postgres distinct_id row with version=0. This means that new inserted distinct_id might never be visible in that table.

Environment

[x] PostHog Cloud
[ ] self-hosted PostHog, version/commit: please provide

Additional context

cc @yakkomajuri and @tiina303 for team ingestion/data consistency context.

Example of a person deletion causing issues: https://posthog.slack.com/archives/C02LR7352SG/p1661424741783579

I think there's only two possible fixes:

Don't allow re-using distinct_ids after deletion (probably long-term safest, but also would likely surprise users)
Add a is_deleted column to person_distinct_id table instead of deleting rows.

Thank you for your bug report – we love squashing them!

tiina303 commented 1 year ago

Another instance of this: https://posthog.slack.com/archives/C0460J93NBU/p1666785896597509

bug 1: we'll just increment the id on deletion bug 2: we can clear the cache bug 3: @macobo or @yakkomajuri any ideas for how we can do this, given for postgres we have cascading deletes and in CH we have eventual deletes.

tiina303 commented 1 year ago

Idea for bug3: instead of using version we could use <update-timestamp>-<version>, but then we could run into races during normal operations (though that's not common due to us sending distinctIDs to the same kafka partition). But maybe there's a way to have something concatenated with the version. An overkill working solution is keeping a separate table in posgress or is_deleted counts per distinct_id and <is_deleted_cnt>-<version> is what we'd collapse by.

So for now I propose that we add a warning to the "delete person" button saying that if you plan to re-use any of the IDs then split IDs instead (which works as we have a new person and a version incremented), otherwise you'll have a bad time.

tiina303 commented 1 year ago

Proposal for bug3: We add a column postgres_id to CH person_distinct_id2 table and use postgres_id and then version in CH instead of just version. In postgres when we update the distinctID - personID mapping we always increment the version & ID stays the same, so that works, if we delete a person and distinctIDs, then upon re-use we create a new entry and get a new bigger postgres ID and we can start the version from 0 again.

tiina303 commented 1 year ago

bug3: discussed this during pipeline sync on Friday: @macobo proposed soft-deletes on postgres (date of deletion into postgres or null for not deleted) as an alternative.

The benefit of the CH route is that currently bad data would not influence anything, with soft-deletes route we might get bad entries from re-use (of persons deleted before the fix) later.

Async-deletes for person events: We delete by person_id column and if the distinct_id is re-used we'll create a new person, so the deletes will be fine and delete exactly the right events. Furthermore importantly there's no infinite deletions runs.

PostHog / posthog