PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
19.43k stars 1.14k forks source link

Integrate with production databases #1244

Open jamesefhawkins opened 3 years ago

jamesefhawkins commented 3 years ago

Context

This is a feature raised based on what I've heard @timgl , @EDsCODE and @mariusandra discussing anecdotally. I got excited when I heard the idea so thought I'd start a discussion at least.

We have a strong sense that the majority of the world's apps aren't instrumented properly, and this could enable "automatic" set up for back end event tracking.

Is your feature request related to a problem? Please describe

Many apps contain a ton of relevant user data, hidden away in databases.

If this isn't set up to push into PostHog proactively, then it is difficult to extract much value from it.

By getting this into the PostHog interface, it'd make understanding user behavior better

This data is sensitive. Since PostHog can be self-deployed, we could enable this functionality even for very large companies.

I think the advantage of solving this would be:

Describe the solution you'd like

We should let PostHog connect to the production database of the apps we help track.

This would involve:

Describe alternatives you've considered

Events need to be passed manually through sending calls to our API or libraries.

Additional context

A very very rough idea of how this could work:

Desktop - 1 Desktop - 2 Desktop - 3

mariusandra commented 3 years ago

Hi @jamesefhawkins! I originally proposed this idea to @timgl . Unsurprisingly, I'm somewhat influenced/biased by my previous work on Insights (MIT license). If we could somehow integrate the work that's done there into PostHog, it would make PostHog a much compelling product IMO!

The pitch I gave to Tim is basically this:

Product Analytics tools can be roughly split into two groups: those that operate on a stream of events (segment, mixpanel, heap, etc) and those that operate on existing databases (looker, google data studio, chartio, metabase, superset, etc).

So far I don't know of one solution that does both sides of product analytics, much less one that's open source. PostHog is in a unique position to claim that mantle, especially if we do the GitLab strategy that we seem to like.


From a technical standpoint, we could reuse a lot of the work I did for Insights. It's built with a very similar stack (TypeScript, React, Kea, Ant Design) and a lot of the code can be easily adapted. The backend is however also written in TypeScript. Yet except for the SQL query generator, the backend is just a very light wrapper around a mongodb'esque database (nedb)... so there's not much to port.

Regarding the required features, there are 3 parts to such a system:

  1. Connecting to the database and creating a simplified data model to explore.
  2. Exploring this data model to find insights and connections in the data.
  3. Visualising and saving those insights.

Note: Go to https://demo.insights.sh/ to play with whatever is in the screencasts below:

Part 1. Data model

I think Looker got a lot of things right with LookML. In a nutshell, they generate a simple data layer on top of your database. You can then modify this layer, add composite fields (e.g. full_name = coalesce(first_name, ' ', last_name)), hide fields or tables from prying eyes (e.g. the password field or the payment_history table).

Insights does the same thing. First you set up a connection:

2020-07-27 12 09 31

It then detects your DB schema, including links between different fields through foreign keys.

Then you have the chance to remove/edit/add fields as needed:

2020-07-27 12 09 55

This is used to hide sensitive data (e.g. password fields) and to hide noise (e.g. only ~30 tables out of ~180 might be relevant as was the case in my last company).

For this to work well, we will need to have a two-level access control, so that teams are split into administrators (who can edit the data model) and users (who can just explore).

In any case, I spent a lot of time tweaking this part of Insights and I think it looks and works really well. This is also the newest part of insights and can probably be copied almost directly.

Part 2. Exploring the data.

Each tool does this slightly differently and insights is no exception. Having played with a few other tools out there (namely Looker, Holistics), I think insights does some things better than the competition.

I remember for example that in Holistics, if my Booking model had two links to the User model (called "student" and "teacher"), I basically had to pick which one I'd keep... as you could have only one entity per model in any given "subset" of data. Looker was similar IIRC. Chart.io was also rather complicated.

In insights I purposely made it super easy to explore all the different connections in your models.

2020-07-27 12 17 29

Once you get a hang of it, it feels really powerful and intuitive. There's this comment in the (now infamous) HN thread (that got me a job with PostHog), which expresses this feeling better than I can:

I totally do not get the negative comments. Having seen a fair share of BI tools (I created and maintain an "awesome" list at 1 which I created for the sole purpose of getting an overview of what's out there) and assessing the stage you're at developing solo, you're getting quite some important UI basics very right imo.

The "Next Steps" give a great impression of what's possible and how to navigate the app.

From an E-Commerce viewpoint, most people want to flexibly play around with Products, Orders and Order Line Items and ratios between the three. Configuring this by hand in something like Google Datastudio etc. is cumbersome, even or more so in Looker (if broken out into separate Views, one needs to needs to create Merged Results Looks; if not the different cardinality objects are always connected). I feel like Insights is handling this really well. If you would add support for calculated fields, you'd offer a compelling E-Commerce controlling solution.

One issue I noticed though is that your nesting seems to be unlimited. I could open an infinite "table" tree if I start from the order_lines or product view --> order_lines --> order --> order_lines --> product --> seller --> orders/ products... ad infinitum.

You want to make sure that this drilldown into FK relationships does not lead to loops. Also I think some visual support where in the nest tree one is at (orders --> order_lines --> product --> sellers) would be great, otherwise it's quite easy to get lost.

Other than that honestly congrats on doing a lot of things right in terms of data viz, responsiveness, sharing capabilities. Really like your work!

The code for the "explorer" view is many years old and could use a good scrub, yet we can definitely build on this!

Part 3. Visualising data.

This part is not yet done for insights. There is a graph with a lot of features that you can see if your table has a time column:

2020-07-27 12 27 09

... but that's it. There are no graphs without a date/time field, you can't add items to dashboards, etc:

2020-07-27 12 26 13

Here we could reuse a lot of what is already in PostHog. The dashboards should look and feel the same anyway, no matter where your data is coming from.


There are still many open issues with insights that need to be solved, with query cancellation and support for huge tables being probably the biggest one (e.g. hitting "refresh" cancels the pending query). The support for facets/splits is still lacking, etc, etc. However all of these can surely be overcome. :)

In summary, I think we could see if and what can be adapted from insights into posthog. Because it's written with a very similar tech stack, is MIT licensed and written by me, we could take a lot of inspiration from there, saving weeks if not months of development time.

Is this a direction worth exploring?

jamesefhawkins commented 3 years ago

User feedback - from the lead data analyst at an enterprise client:

"Can you enrich the data? We already use Adobe Analytics / a bunch of internal databases with valuable stuff in. I want to get our user properties into PostHog so PMs can just easily add cohorts without having to work out how to post it all through the apis/library"