Open macobo opened 1 year ago
The "live events" table wants to transform itself into a "data exploration events" table (with potential live polling). I expect we'll be making three different types of queries on this table, in the interface:
foo
events last 7 days, sort by property bar
)... with pagination support. No "load new", but we can have a "refresh" button.utm_source
properties this month and how many users each one had)Each of them have slightly different requirements for the backend. However I'd like to highlight the "sort by any key over any time period" case. We should support this, as it's what would give the data exploration table the power to explore anything. The problem with this use case, other than queries timing out, is pagination.
The solution I think will solve this in a way that it won't come back, is some combination of the "prepare and export API" and "cursor based API". Basically each initial API call returns 100 rows, a "next url" (or next data node?), and caches the next 9900 rows into a temporary table, which you can query with the cursor.
🤔
Also worth highlighting, this is what one of our competitors does.
@mariusandra That highlights the requirements for live events/data exploration table well, but as highlighted under the issue - this workflow isn't an issue as it doesn't (currently) impose a lot of load.
People using this API to export their data is.
We should treat data exploration and exporting data as two separate problems.
Context: this API is used for 4 things:
There might a common solution to 2 and 3.
Summary
For every gigabyte exported via this API we read 10-40TB from disk for an of 10000-40000 times.
The load generated from this one API is excessive and slows down other workloads (e.g. client queries) running on the cluster.
Background
The API is used for 3 distinct purposes:
Use-case (1) is fine as it's under our control and only ~2% of the load from this API comes from that use-case. It also makes sense - we only need to show hundreds-to-thousands of rows for the users of the table to gain confidence
Use-cases (2) and (3) duplicate behavior of apps in a really inefficient manner. It also doesn't really work for that purpose given queries against it are slow and we limit the amount of data allowed to be exported.
For context, I've argued before to make the API publicly inaccessible on cloud but it was veto-ed without an alternative solution.
Note also that (luckily) the API isn't really widely used today but together with historical exports accounts for 87% of all the load on the ClickHouse cluster today (source).
Why does this API generate so much overhead
The ClickHouse queries this API generates are simple, for example:
There are multiple reasons why this is slow. Please read the ClickHouse manual first as it goes much deeper into this context.
Query is unsupported by table
ORDER BY
sharded_events
table ORDER BY looks like the following:An ORDER BY key that would support the above query would look like
ORDER BY (team_id, timestamp)
as it would allow skipping reading most blocks.As-is, to figure out last 101 events in a time range you need to read the data, sort it and return the last rows. This is really expensive as soon as you're reading hundreds of millions of rows for the last 100.
An implication of this is that dropping lower than to a date-level granularity with the timestamp filter will not speed up queries.
Wide time ranges
When the API is used by the events table, we forcibly limit the time range to a very small time segment. Clients using the API can't be expected to know to do the same and frequently request data for the whole year to get the last 100 events.
Note a proposal to ban wide time ranges was also vetoed: https://github.com/PostHog/posthog/pull/13430
LIMIT/OFFSET pagination
We use LIMIT (and OFFSET in historical exports) to control the data being exported.
Problem with ClickHouse is that there's no magic to make LIMIT/OFFSET faster than reading all the relevant data and then sorting it in-memory.
LIMIT 100
We return up to 100 events a time. This means that users need to do more queries to export their data, which in turn results in more load for us.
ORDER BY timestamp DESC
Returning data
ORDER BY timestamp DESC
means this API can't be pre-computed or cached or made to easily match the ORDER BY key.It reads all the columns
Reading data from ClickHouse is expensive. Given that most of the storage we have is for the
properties
andelements_chain
columns, reading these every query is expensive.How can we improve?
I think this API is fundamentally flawed and instead we should strongly direct users towards apps for data exports.
However if we're not capable of pulling the plug, some ideas for improving this API for data exports. Note that these all are still breaking changes for the API and we'll likely end up with two incompatible APIs for showing events.
(Cloud) Separate ClickHouse cluster
As mentioned before, this API together with historical exports generates 87% of the load posthog ClickHouse cluster is seeing today - slowing down everything else.
Separating ClickHouse nodes serving traffic for ingestion and this kind of traffic would go a long way towards improving general query performance on Cloud.
Projections
We could store a projection of the events table that is cheap to query against. This is pretty much a drop-in solution that will work with the existing API and speed it up.
However:
As such I think this is a no-go.
Prepare-and-export API
In https://github.com/PostHog/posthog/issues/13346, I suggested moving the data to a (temporary) table before starting the export.
We could do the same for this API - before you can export anything, you should 'prepare' a time range to be exported. This temporary table then can be set up in a way to make exporting much cheaper.
Note this would break compatibility for 'live events' export, which is always going to be inefficient and a worse experience compared to doing it via apps.
Cursor-based API
We could make the API function better by making it cursor-based.
Some key ideas:
ORDER BY timestamp DESC
, we could doORDER BY toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)
which would match the table ORDER BYnext_url
to query instead of asking users to construct subsequent queries themselves.date, event name being scanned
to be used as lower and upper bound filters in the query. This would significantly reduce the overhead by limiting the amount of parts and granules read by each query.This approach needs proper benchmarking though and likely still wouldn't be suitable for 'live events' type exports.
Additional context
Related: https://github.com/PostHog/posthog/issues/13346
cc @timgl @xvello @lharries @hazzadous @jamesefhawkins @pauldambra @mariusandra @fuziontech
I'd love for us to take the sane route of removing support for data exports via this API here and rebuilding it under #team-pipeline when they have resources, but I understand this is a hard call to make.
If we do decide to go all in in developing a proper API for exports we should also do product research into this.