PostHog / charts-clickhouse

Helm chart for deploying PostHog with ClickHouse on your K8s infrastructure
MIT License
57 stars 74 forks source link

Telemetry improvements #278

Open guidoiaquinti opened 2 years ago

guidoiaquinti commented 2 years ago

Proposed change

We currently generate some telemetry from Helm install/update operations to help us better understand trends and app/k8s/chart release versions in use.

I’m opening this issue to improve the current setup in order to raise quality and quantity of the signals we get from those events.

Here is a list of some improvements I have in mind:

  1. leverage Helm pre- and post- hooks to track

    1. when an installation/update starts but doesn’t end successfully (KR: update/install success/failure rate)
    2. how long does the operation take (KR: p50/p75/p99 duration time)
    3. ...
  2. make this event telemetry collection optional (and add a note to README.md)

  3. ….

Alternative options

Do nothing

Additional context

See helm_install events in PostHog & HELM_INSTALL_INFO in this repo

marcushyett-ph commented 2 years ago

@tiina303 you shared an idea for an interim metric (using existing telemetry) around success rate of installs...wonder if you could share it here too?

tiina303 commented 2 years ago

It was during the product exercise How many people fail to deploy a self-hosted instance? https://app.posthog.com/insights/rqjfOxEj idea: look at a funnel for helm install tied to a hostname -> organization status report for that hostname => can use unique instance, yay for group analytics. problems:

marcushyett-ph commented 2 years ago

Thanks @tiina303.

I was wondering if this retention view Is potentially another good way to measure it (in the interim) - given it enforces the time between helm install and org status report

tiina303 commented 2 years ago

Maybe we should just start from the org status report & check the retention https://app.posthog.com/insights/wct4ybhr & keep an eye on it looks pretty good at the moment, but if we see bigger drops in first weeks or anything else odd we might want to jump to address it. Looks like relative to previous period might be broken (https://github.com/PostHog/posthog/issues/8366) and that would potentially be better one to use.

marcushyett-ph commented 2 years ago

One small tweak is to aggregate by instance (since the original query will also include cloud organizations). Yep I agree it looks pretty good (especially on a monthly basis):