/api/event is inefficient and expensive

macobo commented 1 year ago

Summary

For every gigabyte exported via this API we read 10-40TB from disk for an of 10000-40000 times.

The load generated from this one API is excessive and slows down other workloads (e.g. client queries) running on the cluster.

Background

The API is used for 3 distinct purposes:

To power live events table
(By clients) to export historical data
(By clients) to export newly incoming data

Use-case (1) is fine as it's under our control and only ~2% of the load from this API comes from that use-case. It also makes sense - we only need to show hundreds-to-thousands of rows for the users of the table to gain confidence

Use-cases (2) and (3) duplicate behavior of apps in a really inefficient manner. It also doesn't really work for that purpose given queries against it are slow and we limit the amount of data allowed to be exported.

For context, I've argued before to make the API publicly inaccessible on cloud but it was veto-ed without an alternative solution.

Note also that (luckily) the API isn't really widely used today but together with historical exports accounts for 87% of all the load on the ClickHouse cluster today (source).

Why does this API generate so much overhead

The ClickHouse queries this API generates are simple, for example:

SELECT uuid, event, properties, timestamp, team_id, distinct_id, elements_chain, created_at 
FROM events 
WHERE team_id = 9632 AND timestamp > '2022-01-01 00:00:00.000000' AND timestamp < '2022-06-29 17:55:13.971000' 
ORDER BY timestamp DESC 
LIMIT 101

There are multiple reasons why this is slow. Please read the ClickHouse manual first as it goes much deeper into this context.

Query is unsupported by table `ORDER BY`

sharded_events table ORDER BY looks like the following:

PARTITION BY toYYYYMM(timestamp)
ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid))

An ORDER BY key that would support the above query would look like ORDER BY (team_id, timestamp) as it would allow skipping reading most blocks.

As-is, to figure out last 101 events in a time range you need to read the data, sort it and return the last rows. This is really expensive as soon as you're reading hundreds of millions of rows for the last 100.

An implication of this is that dropping lower than to a date-level granularity with the timestamp filter will not speed up queries.

Wide time ranges

When the API is used by the events table, we forcibly limit the time range to a very small time segment. Clients using the API can't be expected to know to do the same and frequently request data for the whole year to get the last 100 events.

Note a proposal to ban wide time ranges was also vetoed: https://github.com/PostHog/posthog/pull/13430

LIMIT/OFFSET pagination

We use LIMIT (and OFFSET in historical exports) to control the data being exported.

Problem with ClickHouse is that there's no magic to make LIMIT/OFFSET faster than reading all the relevant data and then sorting it in-memory.

LIMIT 100

We return up to 100 events a time. This means that users need to do more queries to export their data, which in turn results in more load for us.

ORDER BY timestamp DESC

Returning data ORDER BY timestamp DESC means this API can't be pre-computed or cached or made to easily match the ORDER BY key.

It reads all the columns

Reading data from ClickHouse is expensive. Given that most of the storage we have is for the properties and elements_chain columns, reading these every query is expensive.

How can we improve?

I think this API is fundamentally flawed and instead we should strongly direct users towards apps for data exports.

However if we're not capable of pulling the plug, some ideas for improving this API for data exports. Note that these all are still breaking changes for the API and we'll likely end up with two incompatible APIs for showing events.

(Cloud) Separate ClickHouse cluster

As mentioned before, this API together with historical exports generates 87% of the load posthog ClickHouse cluster is seeing today - slowing down everything else.

Separating ClickHouse nodes serving traffic for ingestion and this kind of traffic would go a long way towards improving general query performance on Cloud.

Projections

We could store a projection of the events table that is cheap to query against. This is pretty much a drop-in solution that will work with the existing API and speed it up.

However:

this will forever double our and self-hosted users spend on clickhouse storage
on cloud, we would potentially need to migrate from nvme storage to EBS which costs us more and will be slower for queries

As such I think this is a no-go.

Prepare-and-export API

In https://github.com/PostHog/posthog/issues/13346, I suggested moving the data to a (temporary) table before starting the export.

We could do the same for this API - before you can export anything, you should 'prepare' a time range to be exported. This temporary table then can be set up in a way to make exporting much cheaper.

Note this would break compatibility for 'live events' export, which is always going to be inefficient and a worse experience compared to doing it via apps.

Cursor-based API

We could make the API function better by making it cursor-based.

Some key ideas:

Drop strict ordering: instead of ORDER BY timestamp DESC, we could do ORDER BY toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid) which would match the table ORDER BY
Rethink pagination to be clickhouse-effective and cursor-based:
- The API would return (and require the use of) next_url to query instead of asking users to construct subsequent queries themselves.
- next_url could store cursor like date, event name being scanned to be used as lower and upper bound filters in the query. This would significantly reduce the overhead by limiting the amount of parts and granules read by each query.
- If we reach the end of a segment we increment it.
Make amount of data returned from the API non-strict (or even non-configurable):
- We instead try to maximize the amount of data returned in a request to require less requests to export historic data (100 is way too small of a limit)
- Returning 0 or few rows should be acceptable from the API to make the 'state management' of cursors simpler on our end.

This approach needs proper benchmarking though and likely still wouldn't be suitable for 'live events' type exports.

Additional context

cc @timgl @xvello @lharries @hazzadous @jamesefhawkins @pauldambra @mariusandra @fuziontech

I'd love for us to take the sane route of removing support for data exports via this API here and rebuilding it under #team-pipeline when they have resources, but I understand this is a hard call to make.

If we do decide to go all in in developing a proper API for exports we should also do product research into this.

mariusandra commented 1 year ago

The "live events" table wants to transform itself into a "data exploration events" table (with potential live polling). I expect we'll be making three different types of queries on this table, in the interface:

Latest events, ordered by timestamp, like they are now. With some pagination, and "load new" support.
Events sorted by any key, with a limited time range (e.g. show all custom foo events last 7 days, sort by property bar)... with pagination support. No "load new", but we can have a "refresh" button.
Events with aggregations over a longer time range (e.g. show all utm_source properties this month and how many users each one had)

Each of them have slightly different requirements for the backend. However I'd like to highlight the "sort by any key over any time period" case. We should support this, as it's what would give the data exploration table the power to explore anything. The problem with this use case, other than queries timing out, is pagination.

The solution I think will solve this in a way that it won't come back, is some combination of the "prepare and export API" and "cursor based API". Basically each initial API call returns 100 rows, a "next url" (or next data node?), and caches the next 9900 rows into a temporary table, which you can query with the cursor.

🤔

mariusandra commented 1 year ago

Also worth highlighting, this is what one of our competitors does.

macobo commented 1 year ago

@mariusandra That highlights the requirements for live events/data exploration table well, but as highlighted under the issue - this workflow isn't an issue as it doesn't (currently) impose a lot of load.

People using this API to export their data is.

We should treat data exploration and exporting data as two separate problems.

mariusandra commented 1 year ago

Context: this API is used for 4 things:

Live events
Data exploration events
Historic exports
Live exports (periodic polling)

There might a common solution to 2 and 3.

PostHog / posthog