Closed yakkomajuri closed 3 years ago
Here's what a built query might look like:
This depends on https://github.com/PostHog/posthog/pull/6428
Here are some very quick benchmarks for n properties updated. An important piece of information here is that the average person on Cloud has 13 properties set on them.
Results
_______| Old | New | Diff (magnitude) | Diff (ms) |
10 | 0.5434 | 2.4877 | 4.58x | 1.9443 |
20 | 0.7232 | 4.1982 | 5.80x | 3.4750 |
50 | 0.8411 | 13.7144 | 16.30x | 12.8733 |
100 | 0.8822 | 29.2367 | 33.14x | 28.3545 |
Still, it's quite a bit of a hit. I do think I can optimize the query quite a bit if I spend some time on it. Might be worth it. Probably can pass on an array to the function and avoid selecting the person row each time.
The tradeoff is that that function would get even more convoluted with Postgres loops and stuff, but could be worth a shot. Will look into it.
I wonder if making the function do the update rather answer the question would be better perf (& actually makes the code simpler)? i.e. sth like (likely incorrect SQL)
CREATE OR REPLACE FUNCTION update_person_prop(
in person_id int,
in prop_key text,
in _timestamp text,
in operation text,
in value ?,
out updated bool // not sure if possible and easy but would be nice if we returned if we did the update or not
)
AS $$
UPDATE person_props SET properties->prop_key = value
WHERE NOT property_exists OR
(operation='set' AND _timestamp > stored_timestamp) OR
(last_operation='set_once' AND _timestamp < stored_timestamp)
FROM (
SELECT
properties->prop_key IS NOT NULL as property_exists,
COALESCE(properties_last_updated_at ->> prop_key, '0') as stored_timestamp,
COALESCE(properties_last_operation ->> prop_key, 'set') as last_operation
FROM posthog_person
WHERE id=person_id
) as person_props
$$
LANGUAGE SQL;
I wonder if making the function do the update rather answer the question would be better perf (& actually makes the code simpler)?
I'm leaning towards this as well. Will spend some time today playing with Postgres
Pushed a big WIP change that moves to the new approach. Not posting the benchmark code yet because it's a mess and would be distracting. However, early results look super promising + this is much cleaner. Optimistic about this one.
Initial tests show that query time diff was tiny for 20 properties and only grew to be up to 7ms in total when doing a complicated update of 100 props.
could you land the increment change and rebase on top of it to make reviewing easier 😅
Benchmark:
@tiina303 is taking this work over the line.
And splitting it into multiple PRs.
Changes
⚠️ This is a WIP and still needs to be covered from head-to-toe with tests
Just putting it out so @tiina303 can take a look before we do a sync.
Is it a bit complex? Yes. But it does all property updates at once and doesn't rely on memory at all, doing everything in Postgres.
Checklist