PostHog / posthog

🦔 PostHog provides open-source web & product analytics, session recording, feature flagging and A/B testing that you can self-host. Get started - free.
https://posthog.com
Other
22.05k stars 1.33k forks source link

Partition S3 batch export files by event name #26316

Open tomasfarias opened 2 days ago

tomasfarias commented 2 days ago

As a user of S3 batch exports, it can be easier to process individual files per event, but only partitioning by timestamp and table fields is currently supported.

The main challenge with this feature is that we do not know which event names to partition by before we query ClickHouse. So, we may need to delay the creation of an S3 upload until we start seeing events, and then maintain one S3 upload per event name. The question pending with this is how to allow recovery in the event of a worker crash now that we have a potentially very large number of simultaneous S3 uploads. Maybe Temporal heartbeating is not enough to support this feature and we will need to look into new ways of tracking progress.