PostHog / posthog

šŸ¦” PostHog provides open-source web & product analytics, session recording, feature flagging and A/B testing that you can self-host. Get started - free.
https://posthog.com
Other
21.93k stars 1.32k forks source link

EventQuery generating impossible to run SQL #18253

Open pauldambra opened 1 year ago

pauldambra commented 1 year ago

When viewing a session recording we show users the associated events.

To do that we use a hog event query

{
  "kind": "EventsQuery",
  "select": [
    "uuid",
    "event",
    "timestamp",
    "elements_chain",
    "properties.$window_id",
    "properties.$current_url",
    "properties.$event_type"
  ],
  "orderBy": [
    "timestamp ASC"
  ],
  "limit": 1000000,
  "personId": "226404232",
  "after": "2023-10-27T17:18:11+01:00",
  "before": "2023-10-27T17:20:41+01:00",
  "properties": [
    {
      "key": "$session_id",
      "value": [
        "018b71ee-b20a-7a1f-a8c6-72caf7925bb3"
      ],
      "operator": "exact",
      "type": "event"
    }
  ]
}

A user complained that this was failing for them intermittently. And I noticed the generated SQL is passing a list of event UUIDs that is (sometimes) too large to be passed to the query

This has been reported again

/* user_id:12345 request:_api_projects_6789_query_ */ SELECT events.uuid, events.event, toTimeZone(events.timestamp, 'UTC'), events.elements_chain, nullIf(nullIf(events.`$window_id`, ''), 'null'), nullIf(nullIf(events.`mat_$current_url`, ''), 'null'), nullIf(nullIf(events.`mat_$event_type`, ''), 'null') FROM events WHERE and(equals(events.team_id, 6789), ifNull(equals(nullIf(nullIf(events.`$session_id`, ''), 'null'), '018b71ee-ad5c-7aff-ab6e-fd5f185fe578'), 0), in(events.distinct_id, ['8d0f7620-7ad6-4c95-b243-38aaaf43bde4', '1862bd5d0025ce-09048354e893e5-26021051-1fa400-1862bd5d0034fc', <snip.>

When filtering by a person we lookup the person in django and then add their distinct ids to the query. But the list of distinct ids can be way too long to fit into a query. So CH errors and we can't load events for the recording

neilkakkar commented 1 year ago

oh, interesting! I think it's not event uuids, but person distinct IDs. There's some optimisation converting the 'personID' to a list of distinct IDs (so no joins in the SQL), but this doesn't work when the person has two many distinct IDs

pauldambra commented 1 year ago

oh, interesting! I think it's not event uuids, but person distinct IDs.

Doh, late Friday reading skills šŸ™ˆ

pauldambra commented 1 year ago

here's where a query for a person id is converted to a list of distinct ids to query directly...

https://github.com/PostHog/posthog/blob/f8b5af1768648f00f8b67bc9921aa3904b40e8a1/posthog/hogql_queries/events_query_runner.py#L116-L127

but that list can be too large to fit in a query, and should probably be a subquery šŸ¤·

mariusandra commented 1 year ago

Historic context: this behaviour was copied over from the pre-hogql "live events" page.

We need the properties and distinct id-s of a person to show the nice "(-_-) user@host.com" person display. Getting them with a subquery made the page much slower. The timings have probably changed by now, but we're talking about something like 3sec -> 13sec for loading the event explorer default view.

I'd love it if we didn't have to do this fumbling though... šŸ¤”