PostHog / plugin-server

Service to process and save PostHog events, supporting JS/TS plugins at that
https://posthog.com
8 stars 5 forks source link

WIP safe person prop updates #599

Closed yakkomajuri closed 3 years ago

yakkomajuri commented 3 years ago

Changes

⚠️ This is a WIP and still needs to be covered from head-to-toe with tests

Just putting it out so @tiina303 can take a look before we do a sync.

Is it a bit complex? Yes. But it does all property updates at once and doesn't rely on memory at all, doing everything in Postgres.

Checklist

yakkomajuri commented 3 years ago

Here's what a built query might look like:

View outdated query
```sql UPDATE posthog_person SET properties = final.properties, properties_last_updated_at = final.properties_last_updated_at, properties_last_operation = final.properties_last_operation FROM ( SELECT ( properties || CASE WHEN "$os" = ANY(should_update) THEN jsonb_build_object("$os", "Mac OS X"::text) ELSE '{}' END || CASE WHEN "$initial_browser" = ANY(should_update) THEN jsonb_build_object("$initial_browser", "Chrome"::text) ELSE '{}' END || CASE WHEN ( COALESCE(properties->>"increment_prop", '0') ~ E '^([-+])?[0-9.]+$' ) THEN jsonb_build_object( "increment_prop", ( COALESCE(properties->>"increment_prop", '0')::numeric + 5::numeric ) ) ELSE '{}' END ) as properties, ( COALESCE(properties_last_updated_at, '{}') || CASE WHEN "$os" = ANY(should_update) THEN jsonb_build_object("$os", "2021-10-14T09:30:05.803Z"::text) ELSE '{}' END || CASE WHEN "$initial_browser" = ANY(should_update) THEN jsonb_build_object( "$initial_browser", "2021-10-14T09:30:05.803Z"::text ) ELSE '{}' END || CASE WHEN "increment_prop" = ANY(should_update) THEN jsonb_build_object("increment_prop", "2021-10-14T09:30:05.803Z"::text) ELSE '{}' END ) as properties_last_updated_at, ( COALESCE(properties_last_operation, '{}') || CASE WHEN "$os" = ANY(should_update) THEN jsonb_build_object("$os", "set"::text) ELSE '{}' END || CASE WHEN "$initial_browser" = ANY(should_update) THEN jsonb_build_object("$initial_browser", "set_once"::text) ELSE '{}' END || CASE WHEN "increment_prop" = ANY(should_update) THEN jsonb_build_object("increment_prop", "set"::text) ELSE '{}' END ) as properties_last_operation FROM ( SELECT properties, properties_last_updated_at, properties_last_operation, ARRAY [ CASE WHEN should_update_person_prop(35068, "$os", "2021-10-14T09:30:05.803Z", "set") THEN "$os" ELSE NULL END, CASE WHEN should_update_person_prop(35068, "$initial_browser", "2021-10-14T09:30:05.803Z", "set_once") THEN "$initial_browser" ELSE NULL END, "increment_prop" ] as should_update FROM posthog_person WHERE id = 35068 ) as person_props ) as final WHERE id = 1 RETURNING *; ``` Certainly not the simplest thing in the world. Happy to get feedback and improve this.
yakkomajuri commented 3 years ago

This depends on https://github.com/PostHog/posthog/pull/6428

yakkomajuri commented 3 years ago

Here are some very quick benchmarks for n properties updated. An important piece of information here is that the average person on Cloud has 13 properties set on them.

Results
_______| Old        | New       | Diff (magnitude)  | Diff (ms)     |
10     | 0.5434     | 2.4877    | 4.58x             | 1.9443        |
20     | 0.7232     | 4.1982    | 5.80x             | 3.4750        |
50     | 0.8411     | 13.7144   | 16.30x            | 12.8733       |
100    | 0.8822     | 29.2367   | 33.14x            | 28.3545       |

Still, it's quite a bit of a hit. I do think I can optimize the query quite a bit if I spend some time on it. Might be worth it. Probably can pass on an array to the function and avoid selecting the person row each time.

The tradeoff is that that function would get even more convoluted with Postgres loops and stuff, but could be worth a shot. Will look into it.

tiina303 commented 3 years ago

I wonder if making the function do the update rather answer the question would be better perf (& actually makes the code simpler)? i.e. sth like (likely incorrect SQL)

CREATE OR REPLACE FUNCTION update_person_prop(
            in person_id int, 
            in prop_key text, 
            in _timestamp text, 
            in operation text, 
            in value ?,
            out updated bool   // not sure if possible and easy but would be nice if we returned if we did the update or not
        )
        AS $$ 
            UPDATE person_props SET properties->prop_key = value
            WHERE NOT property_exists OR
            (operation='set' AND _timestamp > stored_timestamp) OR
            (last_operation='set_once' AND _timestamp < stored_timestamp)
            FROM (
                SELECT 
                    properties->prop_key IS NOT NULL as property_exists, 
                    COALESCE(properties_last_updated_at ->> prop_key, '0') as stored_timestamp,
                    COALESCE(properties_last_operation ->> prop_key, 'set') as last_operation 
                    FROM posthog_person
                    WHERE id=person_id
            ) as person_props
        $$
        LANGUAGE SQL;
yakkomajuri commented 3 years ago

I wonder if making the function do the update rather answer the question would be better perf (& actually makes the code simpler)?

I'm leaning towards this as well. Will spend some time today playing with Postgres

yakkomajuri commented 3 years ago

Pushed a big WIP change that moves to the new approach. Not posting the benchmark code yet because it's a mess and would be distracting. However, early results look super promising + this is much cleaner. Optimistic about this one.

Initial tests show that query time diff was tiny for 20 properties and only grew to be up to 7ms in total when doing a complicated update of 100 props.

tiina303 commented 3 years ago

could you land the increment change and rebase on top of it to make reviewing easier 😅

yakkomajuri commented 3 years ago

Benchmark:

Screenshot 2021-10-19 at 14 30 55
yakkomajuri commented 3 years ago

@tiina303 is taking this work over the line.

And splitting it into multiple PRs.