yakkomajuri commented 3 years ago

This is just a prototype and a discussion thread!

Motivation

I think the next 2 big things for plugins are #406 and import plugins. I'd like to tackle those once I'm back.

Nevertheless, following a talk with @jamesefhawkins about a potential user looking to import their Redshift data to us, I decided to give the process a try and use this draft PR as a discussion thread.

What this PR does

It introduces a VM method (wait! discussion coming below) importEventsFromRedshift that allows devs to write just a function to transform their table data into a PostHog event and we'll query their table and ingest events from it. Very Segment-like. Like so:

importEventsFromRedshift takes the rows resulted from a SELECT * on a specified table and allows devs to transform it however they like, such that events come out of the other end.

It significantly reduces the barrier to writing an import plugin, as all of this is handled for you:

Connection to Redshift
Querying in a safe and optimized way (no duplicates) across multiple threads
Plugin server restarts
Ingesting the events
Logs
Metrics
Input validation

Considerations

Current Approach

Pros

Super simple to write a plugin to import data

Cons

Bulks up the plugin server codebase with platform-specific code. Should we support some key third-party services (BQ, Redshift, Snowflake) natively or none at all? Where do we stop once we start adding more and more of these functions?
We'd need to approve every transformation someone writes. This is the case with this and any other approach until #454 is done. I think one thing we could do here UX-wise is have a config option on Cloud to allow installations from certain URLs, or "hidden plugins". This way at least everyone's plugins repository isn't filled with everyone else's transformations.
The dev still needs to specify a config, although there's ways to simplify this if we wanted to

Alternatives

Perfect World Solutions

The best solution I can envision is #454 done to perfection so that we can write one single import plugin per warehouse that takes code as an input for the transformation and does all the rest. This would be like a "plugin-in-plugin" scenario, and would look very much like what Segment does. If we ever get to where we want to get in terms of security when running arbitrary code on our servers, there's no reason a plugin couldn't itself take in arbitrary code and run it with some eval-like method.

The second best thing would be if this were part of plugin-contrib somehow. It'd work essentially the same way, except we'd move the code from one repo to another, which conceptually makes more sense. This would need a bit of glue to work and wouldn't be as nice, but could be a good approach.

Down to Earth

If we decide that providing a method like this natively is not a great idea, we can just provide this as a template. After all, this is just an import plugin with a function missing. This is worse because users have to see all this complicated code instead of it just "magically working", but might make sense to do conceptually.

Other relevant points

I originally wrote the first test with an ORDER BY clause so offsets could be used reliably. The problem is, if the table is to be ordered by an arbitrary column at query-time, this simply won't work. An ORDER BY timestamp query on PostHog events I had in Redshift took several minutes to process around 200k rows. There's many ways to go with this, from things like sort keys, keeping your Redshift table sorted to us using another mechanism like greater than/smaller than WHERE clauses. They all come with their own considerations though, particularly as it pertains to reliable ingestion across multiple threads, so this needs a bit more thought. Could also consider using platform-specific functions like UNLOAD.

neilkakkar commented 3 years ago

Good stuff! :D

I think we ought to move towards the perfect world solution, and I think we can do it without having support for arbitrary code execution on cloud. The reason being: this problem is a lot more constrained than arbitrary code execution.

We could, for example, as part of the config, just expose a function that operates on row and should return a single event. Then the plugin would "carefully" eval this function, with additional checks that prevent any kind of imports on this code / infinite loop checks perhaps or timeouts etc. (basically, a small scope version of the checklist in 454, which is I think easy to implement here). And on the bright side, this bit could probably be reused across all kinds of import plugins.

We could get rid of these checks once a broader version of these happens thanks to #454 (or maybe this small-scope version can become the MVP for the big-scope version)

Thoughts?

yakkomajuri commented 3 years ago

@neilkakkar I think this sounds good on paper but I worry it would end up like when we I tried to make fetch safer and we just kept finding more and more things to do. Ultimately, I feel like this is a slippery slope where we'd end up with a VM inside the VM, which I'm not sure is worth it.

I could be (and would hope) I'm wrong, so need to do some more research. But ultimately when we do #454 we should be at a point where it doesn't matter what you do inside the VM, it should be safe.

neilkakkar commented 3 years ago

Hmmm, ok. For what it's worth, I think fetch is a completely different ballgame, since it's a connection to the rest of the world. Whereas here, we're pretty much only allowing internal transformations in the code, without enabling any imports. This implies we can dumbly cut-off a lot of things, simply by not allowing importing anything at all: all you can work with are the language primitives.

I'd encourage research in this direction!

SIde note: I might be 100% wrong, but comparing with the fetch case feels like overfitting to our past experience, vs honing in on why exactly it didn't work out. (A partial whitelist / blacklist to the rest of the world is a slippery slope, but a complete lockdown has no slope / "salami tactics" possible)

yakkomajuri commented 3 years ago

Well indeed fetch is different, but my whole point is that someone built vm2 to do things like execute arbitrary code, although certainly with a lot more power. Certainly there's smaller-scale things that can be built for purpose-specific use cases (rather than general purpose), but I still think this might not be as easy as it sounds. Certainly could be wrong.

Twixes commented 3 years ago

Import sounds very interesting and potentially valuable, but I don't (and maybe I just fail at this) see how having any of this built into the plugin server is better than a "Redshift Import Plugin" that'd have this as a job.

yakkomajuri commented 3 years ago

That's a fair point @Twixes - and one of the suggestions in the discussion was that this might just be better off as a template. The cons talk quite a bit about potential issues with supporting this natively. The PR here is more for the discussion, and there's no time lost as this can be made into its own plugin very quickly (it's essentially just a native plugin).

That said, I do see a benefit to having it this way though (not saying it's better). It just makes everything 10x easier for the plugin dev, as they don't have to handle (copy-paste from above):

Connection to Redshift
Querying in a safe and optimized way (no duplicates) across multiple threads
Plugin server restarts
Ingesting the events
Logs
Metrics
Input validation

Now, a template solves this too, but you have to have all this code on your plugin, which is a bit off-putting. You'd probably want to trace the steps and find out how it works.

So the benefit here is: write a transformation function and you're done. Watch events flow in. If you're on self-hosted, it's even nicer. Pop open the plugin editor, write 5 lines of JS, and bam.

So yeah, your point is extremely valid and I'm not at all set on providing this natively (as explained above), but it'd be a stretch in my view to say this doesn't have its benefit. It comes down to developer experience and barriers to write a plugin.

Twixes commented 3 years ago

I see what you mean. Maybe this could be a plugin contrib thing then?

yakkomajuri commented 3 years ago

@Twixes also mentioned here:

The second best thing would be if this were part of plugin-contrib somehow.

I'll explore what this would look like. Would need a bit of glue to work so wouldn't be as simple/nice as a native method, but could be the approach here. Maybe passing meta and the transform function to a createBuffer-like thing and we do the rest. You'd still need to do a job export and metrics export though for example.

mariusandra commented 3 years ago

I'd be vary of adding importEventsFromRedshift as a core plugin-server feature. We will lock ourselves out of the exact thing we built the plugin server for in the first place: plugins. Import plugins in this case.

Add a few months of user requests and we will end up with importEventsFromPostgres, importEventsFromS3, importEventsFromSnowflakeVariantA, importEventsFromSnowflakeVariantB, importPersonsFromCustiomerIO, etc, etc.

Hardcoded importFromX functions inside plugin-contrib is not a solution either. That's just kicking the can to a different repo.

So I'd re-evaluate how to do this. Similar to how I built the exportEvents system out of the bigQuery export, I'd suggest just coding it all into a plugin, then perhaps coding another importFromBigQuery plugin... and then abstracting away the commonalities.

I think one of the first things that can be abstracted away is a system to store and increment an arbitrary cursor... and fetch events for it. It could look something like this:

export function setupPlugin () { /* init the connection, etc */ }

function toPluginEvent (postgresRow) { return ... }

export function importNextEvents(cursor: any): [events: PluginEvent, nextCursor: any],  {
  const order = 'id asc'
  const limit = 10
  const results = await db.query(`select * from some_table order by ${order} offset ${cursor} limit ${limit}`)
  const nextCursor = cursor + limit
  return [results.map(toPluginEvent), nextCursor]
}

If everything related to cursor and result storage is done by the plugin, it'd be super easy to build import scripts for any API source. Even our own.

mariusandra commented 3 years ago

Regarding toPluginEvent in the code above, that's definitely something that should be configurable by the user. A textarea with JSON { "fieldInDB": "fieldInPluginEvent" } will be good enough to start, but we could provide a nicer looking pluginConfig widget for it and add features as requested. We probably don't want to eval JS.

(Or... I could just click "fork" next to a plugin and edit the code to suit my needs :D)

yakkomajuri commented 3 years ago

Closing this discussion in favor of https://github.com/PostHog/plugin-server/issues/529

PostHog / plugin-server

Import from Redshift #504

This is just a prototype and a discussion thread!

Motivation

What this PR does

Considerations

Current Approach

Pros

Cons

Alternatives

Perfect World Solutions

Down to Earth

Other relevant points