Import Plugins Discussion

PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.

https://posthog.com

Other

19.85k stars 1.17k forks source link

Import Plugins Discussion #6847

Closed yakkomajuri closed 1 year ago

yakkomajuri commented 2 years ago

📖 Context

I recently moved teams with one of the goals of the move being validating approaches to interacting with data warehouses. Working with @fuziontech and @tiina303 we could dig more deeply into the topics of how to best build solutions to import and export data from PostHog.

For a while we've been talking about import plugins, and I remember @jameshawkins even mentioning somewhere that import plugins would be a pre-requisite for making an external plugins release (on HN etc.).

However, on our first meeting, one of my first questions to James and Tiina was: do import plugins actually make sense?

It's possible the generalization and simplicity we're seeking with these plugins might actually bring a degree of complexity. As such, I'm opening this issue so we can discuss the validity of import plugins from various perspectives: Sales/Growth (@jameshawkins), Product (@marcushyett-ph), and Engineering (@fuziontech / @tiina303 / @timgl).

To give us some context, I decided to build a prototype of what an import plugin would look like, so we can truly see if it makes sense.

📥 Redshift Import Plugin

Note: This is not https://github.com/PostHog/plugin-server/pull/504! It's https://github.com/PostHog/posthog-redshift-import-plugin. Although largely based on the same concepts.

Here's how this plugin works - configure the following config and you're done, events will be coming in:

This will handle for you:

Connection to Redshift
Querying in a safe and optimized way (no duplicates) across multiple threads
Plugin server restarts
Ingesting the events
Logs
Metrics
Input validation

And to transform your data, you can either use an already-available transformation (like JSON Map), or contribute a transformation to the repo.

🎯 Click here if you care to know why this was built this way

The discussion [here](https://github.com/PostHog/plugin-server/pull/504) got me convinced a "plugin templates" concept was the way to go for this. But the more I started to think about how this would work today, the more I got discouraged by the idea. From a PostHog Cloud perspective, providing users with a template means: - They need to create their own plugin (from a fork, but still) - We need to review that **entire plugin** and add it to the official repository - The repository ends up bloated with bespoke import plugins that appear for everyone So instead of telling users to spin up a whole new repo which we'll have to fully review anyway, why not just let them contribute transformations to one specific repo? Add in your transformation function, we check that it's safe, and the only place they will crowd is the select dropdown in the plugin config, but that's mitigated by https://github.com/PostHog/posthog/pull/5578 too. Plus this also allows for someone's transformations to help others, and "standard" transformations to appear, like Amplitude -> PostHog for example. Non-Cloud users on the other hand can indeed fork the repo and do as they wish. I recognize this isn't an amazing solution, but we're limited by https://github.com/PostHog/plugin-server/issues/454. So this is indeed better than the alternative, which is one plugin per import. It allows us to get started quickly with import plugins and iterate as we make progress towards https://github.com/PostHog/plugin-server/issues/454.

🤔 Does this make sense?

So then the question: does this make sense?

Do we want to have an import plugin for each warehouse where users just need to write a transformation OR would we rather provide people with guides (much like what @leggetter is doing) on how to import data and let people fend for themselves?

At the end of the day, a lot of how I've approached plugins is seeking to provide an experience where even developers just starting out are able to make a difference. If you know a bit of JS, you can do a ton. So I'm constantly looking for approaches to simplify the life of PostHog users. However, it could well be that this actually makes things more complicated.

So, thoughts?

mariusandra commented 2 years ago

I think this is amazing work and makes a ton of sense! 💪 🎉

I'm wondering now:

Would love to see what kind of throughput this can get.
I'd love to see a progress bar with how far the import is somewhere in the interface, together with "pause", "restart", etc buttons while the plugin is activated.
There's stuff that can be abstracted away from this (function exportEvents style) to make creating custom import plugins even easier
Can we use the same system with an external PostHog API to make a posthog import plugin? Mainly to ease people migrating from postgres to clickhouse.
Is there a "keep checking for new data every now and then" mode? So that we could keep a postgres and clickhouse posthog in sync this way?

Aside: The plugin config screenshot threw me off a bit cause it says "the plugin will create a table with this name". Probably left over from the redshift export plugin?

marcushyett-ph commented 2 years ago

To answer this question effectively we should speak with our customers and get an understanding of which solves their needs better.

My gut feeling is that the needs of self-hosted vs cloud hosted will be quite different, I would expect cloud users to want to rely more on something quick to configure vs self-hosed users being comfortable creating a bespoke integration that meets their specific requirements.

Given our focus is more aligned with self-hosting users today it might mean that the bespoke integration approach is likely the area we should invest more in. But would be great to validate - I have a few customers in mind we should ask.

tiina303 commented 2 years ago

This is cool =) I like the idea of transforms instead of a plugin template, seems easier for us and users.

There could be privacy concerns about revealing their underlaying data schema (which writing a transformation to an open source repo could do).

Just a bit more context from our Platform meeting yesterday. If I understood jams correctly it might be tricky to pull data (that import plugin would do) because of permissions (requiring Security review etc), but sending data out is usually easy and folks usually know their own data source & how to do that (i.e. the guides alternative: we'd provide example python script, Airflow job, ...).

So if I've understood things correctly they could take our .ts script and run it on their own infra to get around the problem of needing to have security review the data access. So I wonder if we can combine the idea of import plugin and guides by adding a bit of an abstraction layer for job queue and both ends (data input and ingestion to Posthog) and providing a way/guide to run the .ts standalone. So we can take advantage of the shared code in the middle that does the transformation, batching, keeping track of things, retries etc. Just an idea, not sure if this is needed or reasonable and not too complex, potentially better to keep looking at the import plugin and guides separately.

Can we use the same system with an external PostHog API to make a posthog import plugin? Mainly to ease people migrating from postgres to clickhouse.

+1 and for migrations between cloud and self hosted too in the future.

So that we could keep a postgres and clickhouse posthog in sync this way?

Why would one want to do that? I would assume the path is 1x Postres -> ClickHouse & then just continue using ClickHouse only

yakkomajuri commented 2 years ago

@mariusandra

Would love to see what kind of throughput this can get.

I did have some metrics from when I did PostHog/plugin-server#504. Guess I don't have them anymore, will do this.

I'd love to see a progress bar with how far the import is somewhere in the interface, together with "pause", "restart", etc buttons while the plugin is activated.

Yup, definitely. PostHog/plugin-server#406 is coming soon so I've already been brainstorming on this, which falls under PostHog/plugin-server#414.

There's stuff that can be abstracted away from this (function exportEvents style) to make creating custom import plugins even easier

Agreed, like you mentioned before, a good call here is probably to build another one of these and see what has the most overlap.

Can we use the same system with an external PostHog API to make a posthog import plugin? Mainly to ease people migrating from postgres to clickhouse.

Don't see why not. A few things like cursors would work differently but should have quite a bit of overlap.

Is there a "keep checking for new data every now and then" mode? So that we could keep a postgres and clickhouse posthog in sync this way?

Currently this is actually what it does. Should add a config option to import continuously or only up to the last event as per a "snapshot" (count) taken at the beginning of the import.

@marcushyett-ph could you let me know who they are in private? Would like to move on this so can determine my sprint priorities and move fast with this.

@tiina303

There could be privacy concerns about revealing their underlaying data schema (which writing a transformation to an open source repo could do).

This is super easy to mitigate. You could:

a) Fork if self-hosting b) Create a temp table for the import c) But the easiest thing would probably be to just leverage the config inputs. You could take a JSON file and transform from that without revealing too much. See the JSON Map transformation for example.

leggetter commented 2 years ago

When writing the data ingestion docs I thought of two high-level data ingestion scenarios. I can't remember who, but somebody correctly flagged that there's a third scenario. Altogether we have:

Real-time/live events - data as it's generated
Historical data - data in a data warehouse or another analytics provider
Pulling data from a third-party source as part of a regular batch import. Something like Redshift or S3.

So, for importing data we need to address 2 and 3.

# 2. could be very process-intensive as large amounts of data are pulled and transformed from a third party system. Depending on the volume of data created in the third party and the frequency of the checks 3. could also require a lot of resources.

To me @tiina303 raises a few questions/thoughts in my head:

Is the plugin server the right place for this to happen? Is it built to run long-running processes for the flagged scenarios?
I can see a lot of value for customers being able to run this standalone, outside of PostHog. The value is more obvious if they're using PostHog Cloud. But even when self-hosting, running a separate service to manage a specific import mechanism - that may even just be a one-off action - seems to make sense to me.

These questions aside, I think the transformation approach is 👍

fuziontech commented 2 years ago

Imports make sense for pulling from other SaaS services

I think import plugins make a ton of sense for ingesting any data from a 3rd party service where the schema is set and there are no arbitrary or custom transformations. We should manage ingesting data where we know the schema. Things like ingesting from:

Salesforce
Hubspot
Zeta Global
Zoho

Imports do not make sense for pulling from data warehouses

I don't think import plugins make sense for importing data from warehouses. I don't think our plugin server is the right place to be doing the heavy lifting on what is effectively ETL.

It lives outside of these companies normal ETL pipelines (something every company that has a data warehouse/lake has)
It adds another tool to their pipelines that is not managed only adding complexity
There is a ton of configuration need for scheduling, or triggering jobs. Most jobs are triggered by some upstream table getting (re)calculated. Would we create an api for each of these import jobs that they would have to build against?
Security wise this is a vulnerability. You have to cut a username and password for your data warehouse for us. You need to open up your data warehouse to our CIDR IP block for ingestion.
The architecture is not setup for it.
It moves cost onto us for executing the transformations.
How do we manage backfills? Will we expose an interface for selectively loading and reprocessing arbitrary swaths of data?
If we consume data before it is ready we run the risk of reporting on data that is incorrect, partial, and potentially wrong

Because of that, if I was evaluating PostHog I would absolutely decline using import plugins mainly because of the complexity and moving something that is squarely an ETL out of my normal set of tools to build and manage an ETL.

Pushing from your warehouse is the best way to get data out of your warehouse

Using something that is more push based using our existing libraries but adding examples for how you can add pushing to PostHog using things like Airflow, Azkaban, or Luigi. This is beneficial because:

ETL or ELT jobs are not new to anyone operating a data warehouse. This makes it super easy for a data team to add us to their toolkit as a data sink from their warehouse.
They control the processing and the transformation of their data which may require a very extensive DAG of steps before getting loaded into PostHog
The processing all happens on their infrastructure which is already being managed by them and setup for this
Scheduling can be done how it makes sense to them. Most of these ETL jobs are triggered by upstream tables becoming ready in the DAG that is being executed. Us implementing this would be reinventing the wheel.
The customer controls the flow. Can shut it off very easily. Has monitoring already setup for job and upstream jobs.
From a security point of view no credentials are required to be cut from the warehouse. You grab and transform what you need and push it to us
Management of backfills and reprocessing data is automatic and consistent along with how users would do any backfill or reprocessing of data within the organization.

Next steps

The first integration I would pursue would be building out an operator for Airflow or just having an example of how to build a DAG with PostHog on Airflow

Other examples that would be a priority:

Exporting data using a Spark job
Example script for exporting a CSV or JSON blob to PostHog
Documentation on correct schemas for events

Edit: We should also consider partnering with Airbyte as a destination

yakkomajuri commented 2 years ago

Thanks a lot @leggetter and @fuziontech for the thoughts.

Following our meeting and some discussions with clients, I'm also convinced that most warehouse users prefer the push model instead of the pull model.

As such, here are some practical next steps:

Import plugins are bumped off of my priorities list. My sprint focus will be on export from the beginning and its supporting tools.
I'll nevertheless clean up the plugin a bit and put it out with a Beta "use at your own risk" disclaimer. This will at least allow us to get a sense of if there's a group of users we haven't consider and just help us validate these decisions further. There's no harm in having a plugin in the repository that nobody uses, but we won't be investing more time on these at this stage.
Not sure how your priorities are @leggetter, but maybe we should work together to get docs out on data imports? I disagree with @fuziontech that an Airflow job should be the first step - I say let's start with a Python script and go from there. Python is the simplest and the go-to "tool" for so many people (including a client I spoke to yesterday). @leggetter has already done an amazing job here but we can probably go a bit deeper and more warehouse-specific with a guide.

How do the above steps sound? @fuziontech @tiina303 @marcushyett-ph

leggetter commented 2 years ago

My prioritisation process is based on:

Will it help us get to 5 reference customers?
Does it align with Nailing funnels -> Nailing Diagnosis?
Does it improve activation/get customers through to "discovery"?

If we feel that enabling existing, in-progress, and future customers to import from data warehouses will help us achieve one or more of the above then I'll prioritise. I believe it will.

The other considerations for me are:

yakkomajuri commented 2 years ago

Interestingly, it just came to mind that down the line this might actually be one way to build the foundation of the "pluggable backends" we talked about in our call @fuziontech @tiina303 @marcushyett-ph.

Instead of actually sitting on top of your DB Metabase-style and doing all the work to translate queries (+ all the maintenance work), we can rather just have an open bridge and constantly bring data into PostHog in a format we understand and just query it "in our way". In fact, if you use the plugin as is, as of today you could effectively plug in Redshift as your backend for PostHog. 👀

Plus all of the issues regarding security that are relevant when talking about historical imports are irrelevant when it comes to pluggable backends, since you gotta give up the credentials anyway.

jamesefhawkins commented 2 years ago

I'd love some context on pluggable backends @yakkomajuri if you concluded anything for / against. Would help with the story with customers / investors etc.

yakkomajuri commented 2 years ago

@jamesefhawkins in our call we talked about these being something we may or may not do in the longer-term future. I believe @marcushyett-ph mentioned how if we wanted to do this we'd probably need a team supporting each database available, and this would all be a lot of work.

However, would love to get people's thoughts on this. Didn't think of it yesterday but essentially plugins like the Redshift import one could be a way to get closer to something like "pluggable backends" much much faster, although of course with its own limitations.

tiina303 commented 2 years ago

Regarding priorities, I would love if we started with something for people to move from Postgres -> Clickhouse. Python script for that sounds great (so agree with Yakko's proposed priority). Perhaps we can create a generic python script template & then a specific example (i.e. the Postgres -> Clickhouse move one).

pluggable backends for / against

I like this idea in general for longer term future. Jams in the call mentioned that we could potentially get more users with data warehouse this way as one of the problems with running self-hosted posthog is managing it. Specifically managing both the compute and storage, if we can use a pluggable backend, then they have no storage maintenance overhead (their data warehouse might get a bit more data, but that's way easier than having two separate data stores).

import pluggin for pluggable backends

Just to make sure I understand correctly how it would work: we don't store any data (maybe caching some) & for every query do the transformation and pull data? I can see how this could work fine for simple things (e.g. similar schemas & queries), but I'm not sure how well it would work if the transformations are more complex for more complex queries. If we don't control the data layout sorting and filtering could get really expensive (something we have an index on in ClickHouse their data store might not have an index on making us look through all the values). Us querying inefficiently could slow down (or lock rows in?) their DB, I'm not sure if there's a way and how well it works to limit users to certain amount of resources only?

leggetter commented 2 years ago

I would love if we started with something for people to move from Postgres -> Clickhouse. Python script for that sounds great (so agree with Yakko's proposed priority). Perhaps we can create a generic python script template & then a specific example (i.e. the Postgres -> Clickhouse move one).

This is a good one and potentially very important for the existing community who may feel left behind otherwise. Thanks for flagging, @tiina303 🙌

mariusandra commented 2 years ago

Very insightful discussion above!

I'm completely on board with the notion that very often it's easier to push data into posthog, instead of going through the potential security/compliance nightmare described above. It's also rather likely that our current focus customers would prefer to push their own data, and based on that we should deprioritize this task (and work on guides on how to posthog.capture your data form a huge text log file with a small python script).

However, two cases discussed above still make me believe import plugins (and especially abstractions around them with corresponding UI features) will be very useful to a lot of users:

SaaS and other services where there is no alternative option. There are some very interesting APIs we could import from. Think all competitors or things like the "github star sync" plugin... or a "twitter mentions" plugin.
Local services where the cost preparing data for exports is miniscule. If I'm a small startup and I have some local data that I want to get in to posthog, I might not care if there's a faster way to get data (amounts are small) or compliance (we're in the same datacenter and I have all the keys; I'm only importing during a migration and don't care if it takes 30min or 2h to import 10gb; I'm a sysadmin and I'm just testing, etc).
"Pluggable backends" as described by Yakko. I see these as "import from a local append-only database and keep the data coming". This should work well for append-only or time-series datasets, but will get complicated if something changes in the initial data (when past rows get changed). I'm not sure what are some real examples when I'd use this though.

@leggetter Is the plugin server the right place for this to happen? Is it built to run long-running processes for the flagged scenarios?

In short, yes. The plugin server is built for handling huge amounts of data/events flowing through it, and long running processes that both import and export data.

Even if it's often better to push data, just saying "we have a plugin for redshift/bigquery/snowflake imports (thought you might want to push out yourself)" sounds enabling.

leggetter commented 2 years ago

Request to migrate from self-hosted (most likely Postgres) to Cloud in the community Slack https://posthogusers.slack.com/archives/CT7HXDEG3/p1629628505141700

fuziontech commented 2 years ago

FWIW I don't think we should be thinking about "Pluggable backends" right now. I think that is a distraction. That may or may not even live in the plugin server at all. As things are setup currently that seems like something that would be built on the backend of app, but again. I don't think we should be designing for that currently. If anything we should discuss the merits of the product outside of implementation in posthog/product-internal#150

Maybe this (export plugins) is something we should jam on at the offsite?

yakkomajuri commented 2 years ago

A few updates:

A paying customer is using the Redshift Import plugin
I just had a call with a company providing "data consultancy" services who would love to have a Snowflake import. They're working with some big clients and there was a comment that they see the space moving a bit more towards "pull" models

Just some things to consider

posthog-contributions-bot[bot] commented 2 years ago

This issue has 3251 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo