Closed yakkomajuri closed 3 years ago
Good stuff! :D
I think we ought to move towards the perfect world solution, and I think we can do it without having support for arbitrary code execution on cloud. The reason being: this problem is a lot more constrained than arbitrary code execution.
We could, for example, as part of the config, just expose a function that operates on row
and should return a single event
. Then the plugin would "carefully" eval
this function, with additional checks that prevent any kind of imports on this code / infinite loop checks perhaps or timeouts etc. (basically, a small scope version of the checklist in 454, which is I think easy to implement here). And on the bright side, this bit could probably be reused across all kinds of import plugins.
We could get rid of these checks once a broader version of these happens thanks to #454 (or maybe this small-scope version can become the MVP for the big-scope version)
Thoughts?
@neilkakkar I think this sounds good on paper but I worry it would end up like when we I tried to make fetch
safer and we just kept finding more and more things to do. Ultimately, I feel like this is a slippery slope where we'd end up with a VM inside the VM, which I'm not sure is worth it.
I could be (and would hope) I'm wrong, so need to do some more research. But ultimately when we do #454 we should be at a point where it doesn't matter what you do inside the VM, it should be safe.
Hmmm, ok. For what it's worth, I think fetch is a completely different ballgame, since it's a connection to the rest of the world. Whereas here, we're pretty much only allowing internal transformations in the code, without enabling any imports. This implies we can dumbly cut-off a lot of things, simply by not allowing importing anything at all: all you can work with are the language primitives.
I'd encourage research in this direction!
SIde note: I might be 100% wrong, but comparing with the fetch case feels like overfitting to our past experience, vs honing in on why exactly it didn't work out. (A partial whitelist / blacklist to the rest of the world is a slippery slope, but a complete lockdown has no slope / "salami tactics" possible)
Well indeed fetch is different, but my whole point is that someone built vm2
to do things like execute arbitrary code, although certainly with a lot more power. Certainly there's smaller-scale things that can be built for purpose-specific use cases (rather than general purpose), but I still think this might not be as easy as it sounds. Certainly could be wrong.
Import sounds very interesting and potentially valuable, but I don't (and maybe I just fail at this) see how having any of this built into the plugin server is better than a "Redshift Import Plugin" that'd have this as a job.
That's a fair point @Twixes - and one of the suggestions in the discussion was that this might just be better off as a template. The cons talk quite a bit about potential issues with supporting this natively. The PR here is more for the discussion, and there's no time lost as this can be made into its own plugin very quickly (it's essentially just a native plugin).
That said, I do see a benefit to having it this way though (not saying it's better). It just makes everything 10x easier for the plugin dev, as they don't have to handle (copy-paste from above):
Now, a template solves this too, but you have to have all this code on your plugin, which is a bit off-putting. You'd probably want to trace the steps and find out how it works.
So the benefit here is: write a transformation function and you're done. Watch events flow in. If you're on self-hosted, it's even nicer. Pop open the plugin editor, write 5 lines of JS, and bam.
So yeah, your point is extremely valid and I'm not at all set on providing this natively (as explained above), but it'd be a stretch in my view to say this doesn't have its benefit. It comes down to developer experience and barriers to write a plugin.
I see what you mean. Maybe this could be a plugin contrib thing then?
@Twixes also mentioned here:
The second best thing would be if this were part of plugin-contrib somehow.
I'll explore what this would look like. Would need a bit of glue to work so wouldn't be as simple/nice as a native method, but could be the approach here. Maybe passing meta
and the transform function to a createBuffer
-like thing and we do the rest. You'd still need to do a job export and metrics export though for example.
I'd be vary of adding importEventsFromRedshift
as a core plugin-server
feature. We will lock ourselves out of the exact thing we built the plugin server for in the first place: plugins. Import plugins in this case.
Add a few months of user requests and we will end up with importEventsFromPostgres
, importEventsFromS3
, importEventsFromSnowflakeVariantA
, importEventsFromSnowflakeVariantB
, importPersonsFromCustiomerIO
, etc, etc.
Hardcoded importFromX
functions inside plugin-contrib is not a solution either. That's just kicking the can to a different repo.
So I'd re-evaluate how to do this. Similar to how I built the exportEvents
system out of the bigQuery export, I'd suggest just coding it all into a plugin, then perhaps coding another importFromBigQuery
plugin... and then abstracting away the commonalities.
I think one of the first things that can be abstracted away is a system to store and increment an arbitrary cursor
... and fetch events for it. It could look something like this:
export function setupPlugin () { /* init the connection, etc */ }
function toPluginEvent (postgresRow) { return ... }
export function importNextEvents(cursor: any): [events: PluginEvent, nextCursor: any], {
const order = 'id asc'
const limit = 10
const results = await db.query(`select * from some_table order by ${order} offset ${cursor} limit ${limit}`)
const nextCursor = cursor + limit
return [results.map(toPluginEvent), nextCursor]
}
If everything related to cursor and result storage is done by the plugin, it'd be super easy to build import scripts for any API source. Even our own.
Regarding toPluginEvent
in the code above, that's definitely something that should be configurable by the user. A textarea with JSON { "fieldInDB": "fieldInPluginEvent" }
will be good enough to start, but we could provide a nicer looking pluginConfig
widget for it and add features as requested. We probably don't want to eval
JS.
(Or... I could just click "fork" next to a plugin and edit the code to suit my needs :D)
Closing this discussion in favor of https://github.com/PostHog/plugin-server/issues/529
This is just a prototype and a discussion thread!
Motivation
I think the next 2 big things for plugins are #406 and import plugins. I'd like to tackle those once I'm back.
Nevertheless, following a talk with @jamesefhawkins about a potential user looking to import their Redshift data to us, I decided to give the process a try and use this draft PR as a discussion thread.
What this PR does
It introduces a VM method (wait! discussion coming below)
importEventsFromRedshift
that allows devs to write just a function to transform their table data into a PostHog event and we'll query their table and ingest events from it. Very Segment-like. Like so:importEventsFromRedshift
takes the rows resulted from aSELECT *
on a specified table and allows devs to transform it however they like, such that events come out of the other end.It significantly reduces the barrier to writing an import plugin, as all of this is handled for you:
Considerations
Current Approach
Pros
Cons
Alternatives
Perfect World Solutions
The best solution I can envision is #454 done to perfection so that we can write one single import plugin per warehouse that takes code as an input for the transformation and does all the rest. This would be like a "plugin-in-plugin" scenario, and would look very much like what Segment does. If we ever get to where we want to get in terms of security when running arbitrary code on our servers, there's no reason a plugin couldn't itself take in arbitrary code and run it with some eval-like method.
The second best thing would be if this were part of
plugin-contrib
somehow. It'd work essentially the same way, except we'd move the code from one repo to another, which conceptually makes more sense. This would need a bit of glue to work and wouldn't be as nice, but could be a good approach.Down to Earth
If we decide that providing a method like this natively is not a great idea, we can just provide this as a template. After all, this is just an import plugin with a function missing. This is worse because users have to see all this complicated code instead of it just "magically working", but might make sense to do conceptually.
Other relevant points
ORDER BY
clause so offsets could be used reliably. The problem is, if the table is to be ordered by an arbitrary column at query-time, this simply won't work. AnORDER BY timestamp
query on PostHog events I had in Redshift took several minutes to process around 200k rows. There's many ways to go with this, from things like sort keys, keeping your Redshift table sorted to us using another mechanism like greater than/smaller thanWHERE
clauses. They all come with their own considerations though, particularly as it pertains to reliable ingestion across multiple threads, so this needs a bit more thought. Could also consider using platform-specific functions likeUNLOAD
.