PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
19.45k stars 1.14k forks source link

Add deduplication on query level #17136

Open MarconLP opened 10 months ago

MarconLP commented 10 months ago

We've received several reports on duplicated events. We will tackle this problem in three places (Duplicate removal at ingestion, Duplicate removal at CH, Duplicate removal at query time)

Add duplicate removal at query time:

related: https://posthog.slack.com/archives/C0374DA782U/p1692700659345339?thread_ts=1692696455.147909&cid=C0374DA782U

MarconLP commented 10 months ago

how many duplicates can we process before performance is taking a big hit

jetaggart commented 10 months ago

Any updates on this? We have some fairly serious data issues due to duplication making many of our reports unusable. We can create custom HogQL to do this deduping on the report level but obviously that has downside. Our team that creates self-service reports are unable to trust the numbers right now.

MarconLP commented 10 months ago

Hey @jetaggart, could you open a ticket through the PostHog app? https://app.posthog.com/home#supportModal=support%3Adata_integrity

jetaggart commented 10 months ago

I have an open ticket that is being worked on however I'm wondering what the underlying issue is and what resolution will be (ticket 5270)