PostHog / posthog

πŸ¦” PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
21.7k stars 1.3k forks source link

Plugins Epic #1896

Closed mariusandra closed 3 years ago

mariusandra commented 4 years ago

In order to not pollute the PR with discussion that will be hidden by a thousand commits, I'll describe here what is currently implemented and where do we go from here.

Plugins in PostHog

One of the coolest ideas that came from the PostHog Hackathon was the idea of Plugins: small pieces of code that can be installed inside posthog, providing additional or custom features not found in the main repo.

Two examples of plugins that are already built:

Currently plugins can only modify events as they pass through posthog. Support for scheduled tasks, API access, etc is coming. More on this later.

Installing plugins via the interface

Assuming the following settings are set:

INSTALL_PLUGINS_FROM_WEB = get_bool_from_env("INSTALL_PLUGINS_FROM_WEB", True)
CONFIGURE_PLUGINS_FROM_WEB = INSTALL_PLUGINS_FROM_WEB or get_bool_from_env("CONFIGURE_PLUGINS_FROM_WEB", True)

... the following page will show up:

2020-10-15 16 33 36

Plugins are installed per-installation and configured per-team. There is currently no fine-grained access control. Either every user on every team will be able to install/configure plugins or not.

When installing plugins or saving the configuration, plugins are automatically reloaded in every copy of the app that's currently running. This is orchestrated with a redis pubsub listener.

Installing plugins via the CLI

Alternatively, you may set the INSTALL_PLUGINS_FROM_WEB setting to False and use the posthog-cli to install plugins:

2020-10-15 16 39 05

Plugins can be installed from a git repository or from a local folder:

2020-10-15 16 41 04

Plugins installed via the CLI will be loaded if you restart your posthog instance. They will then be saved in the database just like the plugins installed via the web interface. Removing the plugins from posthog.json uninstalls the plugins the next time the server is restarted.

In case you use both web and CLI plugins, the settings in posthog.json will take precedence and it will not be possible to uninstall these plugins in the web interface.

As it stands now, it's not possible to configure installed plugins via the CLI. The configuration is still done per team in the web interface.

Creating plugins

It's pretty simple. Just fork helloworldplugin or use the CLI:

2020-10-15 16 48 07

Todo for this iteration

Future ideas

Feedback

All feedback for the present and the future of plugin support in posthog is extremely welcome!

macobo commented 4 years ago

I really like it!

One early 'internal' consumer for this could be session recording, but this would require:

  1. scheduled tasks - for setting up retention pruning of data.
  2. access to models - For reading events within the scheduled tasks as well as saving sessions moved to S3 links to a table

However now I'm thinking on this more deeply, I think point 2 would cause issues - on cloud, we would need to do a lot of work to limit access to only your own organizations data.

I think instead of relying on plugins, let's roll it into main repo and extract as a plugin as it evolves. Thoughts?

mariusandra commented 4 years ago

I believe scheduled tasks and access to models will be easy to add. I just wanted to get all the rest solid first.

Regarding API access, you can literally now inside a plugin do:

from posthog.models import Event
events = Event.objects.filter(team_id=config.team)

... and then do whatever you need.

It's just not wise to do these queries inside the process_event block though. So we need scheduled tasks. This was basically already handled during the hackathon by just passing the celery object to the plugin, giving the plugin the opportunity to register tasks, but I removed that for now.

Obviously plugin authors will need to be careful to scope their queries to a team, like we do now in the main codebase. This will be up to plugin authors to handle though... :/

And we won't run unknown plugins on app, so this shouldn't really be an issue.

macobo commented 4 years ago

No custom plugins on cloud right?

If we're exposing our models then I think we should do another refactoring first as well: either rename our root module in this repo due to the conflict with https://github.com/PostHog/posthog-python. It's more than conceivable that users would love access to both without needing to hack around it.

mariusandra commented 4 years ago

I think this could be a great use case for a plugin and a nice example for others to follow when making their own retention style plugins. That said, feel free to start coding this inside app and we can extract later.

jamesefhawkins commented 4 years ago

I'm so excited by this, but I think we need to think about ensuring adoption.

Broadening the plugins' appeal

The range of what you can do is severely limited at the moment. Opening up all models would make plugins far more versatile.

Improving development process

It's pretty simple

Whilst making a plugin is simple, for someone outside our core team who isn't already doing local development, I don't think it's trivial - they would need to deploy PostHog locally, manually = ~12 commands.

The advantage of making this entire process end to end trivial is that we'll get more people in the community building plugins. This would be a strategic benefit as it'll make us achieve platform status.

A few thoughts on improving this - although I am very open to alternative ideas, as I'm not really the target audience:

Security

Could we automatically filter all queries by team for any plugin, somehow? It feels like relying on people to add their own appropriate team filters is unrealistic.

mariusandra commented 4 years ago

Here's another thing to consider.

Plugins are currently exported as a class with the following signature:

# exampleplugin/__init__.py
from posthog.plugins import PluginBaseClass, PosthogEvent, TeamPlugin

class ExamplePlugin(PluginBaseClass):   
    def __init__(self, team_plugin_config: TeamPlugin):
        super().__init__(team_plugin_config)
        # other per-team init code
    def process_event(self, event: PosthogEvent):
        event.properties["hello"] = "world"
        return event
    def process_identify(self, event: PosthogEvent):
        pass
    def process_identify(self, event: PosthogEvent):
        pass

The classes for these plugins are loaded into python when the app starts (or a reload is triggered). These classes are initialized (plugin = ExamplePlugin(config)) also on app start (or reload), but per team and only if there's a team-specific config stored for this plugin.

This means that in a large app with multiple teams, we can have thousands if not more copies of the same object loaded in memory. For example, if we load a 62MB IP database with every initialization of the maxmind plugin for each team, with a thousand teams we'll need 62GB of RAM.

Thus it must be possible for plugins to share state per app instance and thus they need some per_instance and per_team init hooks.

Here are two ideas to solve this.

Option 1. Functional shared-nothing style:

# maxmindplugin/__init__.py
import geoip2
from posthog.plugins import PluginBaseClass, PosthogEvent, TeamPluginConfig
from typing import Dict, Any

def instance_init(global_config: Dict[str, Any]):
    geoip_path = global_config.get("geoip_path", None)
    reader = None

    if geoip_path:
        reader = geoip2.database.Reader(geoip_path)
    else:
        print("πŸ”» Running posthog-maxmind-plugin without the 'geoip_path' config variable")
        print("πŸ”Ί No GeoIP data will be ingested!")

    return {
        "config": global_config,
        "reader": reader
    }

# # Not used for this plugin
# def team_init(team_config: TeamPluginConfig, instance: Dict[str, Any]):
#     return {
#         "config": team_config.config
#         "cache": team_config.cache,
#         "team": team_config.team,
#     }

def process_event(self, event=PosthogEvent, team_config=TeamPluginConfig, instance_config=Dict[str, any]):
    if instance_config.get('reader', None) and event.ip:
        try:
            response = instance_config['reader'].reader.city(event.ip)
            event.properties['$country_name'] = response.country.name
        except:
            pass

    return event

def process_identify(self, event: PosthogEvent, team_config=TeamPluginConfig, instance_config=Dict[str, any]):
    pass

def process_identify(self, event: PosthogEvent, team_config=TeamPluginConfig, instance_config=Dict[str, any]):
    pass

I'm not set on the naming of things... nor on the exact shape of dicts/objects returned from each function, so please ignore that (and share feedback if you have it). The point is this being a "serverless" or "functional" shared nothing style approach. We would call the instance_init or teaminit functions as needed and pass the objects returned to each process* method.

Option 2 - class globals


class MaxmindPlugin(PluginBaseClass):
    @staticmethod
    def init_instance(global_config: Dict[str, Any]):
        geoip_path = global_config.get("geoip_path", None)

        if geoip_path:
            MaxmindPlugin.reader = geoip2.database.Reader(geoip_path)
        else:
            print("πŸ”» Running posthog-maxmind-plugin without the 'geoip_path' config variable")
            print("πŸ”Ί No GeoIP data will be ingested!")
            MaxmindPlugin.reader = None

    def init_team(self, team_config):
        pass

    def process_event(self, event: PosthogEvent):
        if MaxmindPlugin.reader and event.ip:
            try:
                response = MaxmindPlugin.reader.city(event.ip)
                event.properties['$country_name'] = response.country.name
            except:
                # ip not in the database
                pass

        return event

Here the same class would have two methods, one static init_instance that sets properties on the class itself... and one class method init_team that is called from __init__(self) when the class is initialized.

In this scenario, we would still init a new class per team per plugin, but with a much smaller payload.

Which option do you prefer? 1 or 2?

mariusandra commented 4 years ago

I went with option 2 for now.

Also, I made a small TODO list.

Bigger features:

Dev Experience:

UX:

Safety:

Docs:

Sample plugins:

mariusandra commented 4 years ago

For those following along, experimenting with plugins on Heroku, I have run across a new and unexpected issue!

image

The PUBSUB worker reload code creates too many connections to Redis, making the app unusable on Heroku with the free redis instance. Celery is consistenely running into "redis.exceptions.ConnectionError: max number of clients reached" errors and won't process tasks.

Unrelated, the worker is also constantly running out of memory and starts using swap:

image

The explanation is that celery forks a new worker for each CPU core it finds. In the $7/mo heroku hobby dynos, 8 CPUs are reported:

image

... thus taking up (1+8) * 70MB of RAM and an additional 1+8 celery connections for the plugin reload PUBSUB.

On another branch preview, without the plugin reload pubsub, 12-19 redis connections are already used, making the extra 9 clearly exceed the limit:

image

Bumping the redis addon to one with 40 connections, I see that 28 are used.

In addition to all of this, there seems to be some issue reloading plugins in the web dynos:

image

I'll keep investigating, though it seems it might be smart to ditch the pubsub for plugin reloads and just use a regular polling mechanism... though I need to test this.

Alternatively, it might be wiser to hoist the reload up from per-fork to per-worker, putting it basically into ./bin/start-worker and reloading the entire process once a reload takes place.

mariusandra commented 4 years ago

Hello!

Gallery of failed attempts

Since I last posted, the following has happened:

Plugins via Node-Celery

Since we're already using celery, it just made a lot of sense to use the existing infrastructure and pipe all events though celery. It works beautifully! 🀩

To enable, set PLUGINS_ENABLED=1 and run the app. That's all you need. This might be enabled by default in the next version?

You might also need to run bin/plugins-server, depending on your setup. The scripts bin/start-worker and bin/docker-worker now call bin/plugins-server it automatically. The command runs a nodejs package called posthog-plugins, which starts a nodejs celery process that listens to tasks with the name process_event_with_plugins, runs plugins on the event and then dispatches another process_event task that django picks up to continue the work.

In case the plugins server is down, events will just queue up and hopefully nothing is lost. Plugin reloads are done via a redis pubsub system, triggered by the app.

Plugin format

To install a plugin all you need is a github repo with an index.js file. Ideally though you'd also have a plugin.json file that contains some metadata. Here's the example for the helloworldplugin (updated for JS):

// plugin.json
{
  "name": "helloworldplugin",
  "url": "https://github.com/PosthHog/helloworldplugin",
  "description": "Greet the World and Foo a Bar, JS edition!",
  "main": "index.js",
  "lib": "lib.js",
  "config": {
    "bar": {
      "name": "What's in the bar?",
      "type": "string",
      "default": "baz",
      "required": false
    }
  }
}

The index.js file contains the main plugin code. The lib.js file contains other library code. This could even be a bunch of stuff rolled up with rollup or another bundler, kept away from the main plugin code. The config part specifies config parameters that will be asked in the interface.

The lib.js file can be as extensive as you want it. Here's the helloworldplugin example:

// lib.js
function lib_function (number) {
    return number * 2;
}

This function is now available in index.js for the app code to use. The currency normalization plugin makes better use of this by putting functions like fetchRates in there.

Here's what you can do in the plugin's index.js:

// index.js
async function setupTeam({ config }) {
    console.log("Setting up the team!")
    console.log(config)
}

async function processEvent(event, { config }) {
    const counter = await cache.get('counter', 0)
    cache.set('counter', counter + 1)

    if (event.properties) {
        event.properties['hello'] = 'world'
        event.properties['bar'] = config.bar
        event.properties['$counter'] = counter
        event.properties['lib_number'] = lib_function(3)
    }

    return event
}

The setupTeam function is run when plugins are reloaded and the team config is read from the db. The only thing you can really do there is fetch for things and use cache to store data.

The processEvent function runs for each event. Since everything goes through celery directly before hitting the django app, I removed the previous processIdentify and other calls. Thus you should make sure that event.properties exists before changing anything. It for example doesn't exist for $identify calls and some others.

Inside these JS files you can run the following:

There's still a lot of work to do to clean this up even further, though what is now in the plugin-v8 branch works and unless you enable the PLUGINS_ENABLED key, the only thing that will happen is that we will start the node plugin server anyway in the bin/*-worker scripts, but it won't just do anything. That will take up 2 redis connections though - one for the cache, one for pubsub

Next steps

Here are some todo items to :

Even further steps:

macobo commented 4 years ago

Noting down another plugin idea: tracking how many times a library has been installed. This should again help make product decisions (e.g. which to add autocapture to: flutter vs react-native vs ios).

mariusandra commented 3 years ago

New stuff!

On all self-hosted installations (no feature flag needed & multi tenancy excluded), when you load plugins from "project -> plugins", you're greeted with this page:

image

It has two features:

Once enabled per team, in api/process_event, we just change the name and the queue of the dispatched celery task from process_event to process_event_with_plugins.

This task will be picked up by the node worker via celery. After running the event through all relevant plugins for the team, it sends a new process_event task with the modified payload. This is then picked up by the regular python celery task, which never knew its payload had been tampered with! Sneaky.

There's also a much much much nicer interface to install and configure the plugins (thank you @paolodamico !!):

2020-11-02 02 01 38

There are a few rough edges (no upgrades, only string fields), but it as a first beta it gets the job done.

If there's an error in any plugin, either during initialisation or when processing an event, you can also see the error together with the event that broke it:

2020-11-02 02 14 16

And when you decide you have had enough, just disable the plugin system and all events pass through celery as normal:

2020-11-02 02 17 26

paolodamico commented 3 years ago

Jotting down some recommendations for the next iteration. The error thing is pretty cool, some suggestions to improve this:

mariusandra commented 3 years ago

Hey @paolodamico , totally agree with the suggestions and we should make this much better. For now, there's at least something. The error message itself (your third point) is due to the currency plugin. The API actually replies that the key is incorrect, but that's swallowed by the plugin.

mariusandra commented 3 years ago

Master plan with plugins:

  1. We are here: Release support for plugins that modify inflight events and that run in a vm sandbox
    • done: npm plugins (enables compiled plugins)
    • done: plugin attachments (enables maxmind plugin)
  2. Test, scale and eventually run this system on app (only with preinstalled plugins)
    • blocker: how to run scalability tests at scale on any PR? need peak req/s per PR.
    • goals: measure how much time is spent in JS, what is the impact of each additional plugins, etc
    • status: the current very inefficient implementation completes a task in 0.0002s-0.0003s for me locally, meaning ~3k req/sec (singlethreaded), not sure how that changes in the cloud
    • todo: solve the blocker and measure how many events the app can ingest with and without plugins
    • todo: add a nodejs threadpool to the server, respect WEB_CONCURRENCY
    • todo: investigate if using via pm2 (and pm2-runtime) makes sense
    • todo: optimise vm2 (nodejs vm) performance and security (compiled sandboxes?)
    • todo: run loads of scalability and stress tests on plugins (e.g. make a plugin that just waits for 30sec and then sends on the event, make one that blocks the running thread, etc)
  3. Add scheduled plugins
    • why: this makes the plugins bidirectional: data in and out.
    • blocker: it's basically a distributed locking problem, the trick is to avoid missing precisely timed events (main scheduler down 3:59 -> 4:01 due to rolling restart, event was scheduled for 4am; plugin reloads should be done well)
    • example: github stars plugin
    • example: other random plugins that pull for updates from somewhere and sync them as events
    • how: simple functions in index.js that might be initialised by registerScheduledTask('*/4 * * * *', pollForEvents)
  4. Add frontend plugins
    • why: so people can build cool stuff inside posthog itself
    • blocker: good plugin build and testing tools
    • todo: serve .js files over http from posthog-plugin-server
    • todo: add various callbacks and hooks into our the frontend as needed...
      • registerSidepanelItem()
      • registerScene()
      • registerGraphType()
    • example: surverys - nicer view on incoming survey result events
    • example: show map for countries/cities
    • example: custom longer config pages for plugins
  5. Add API plugins
    • todo: route HTTP requests to similar functions as for scheduled events
    • why: mainly for use with frontend plugins
    • why: this unlocks webhooks into plugins, querying for stats/metrics, etc
    • question: do we give access to the database for running any SQL query or just the external posthog API?
    • question: should the redis cache be used as persistent storage for plugins or should we make a simpler mongodb-style interface (a'la nedb)

I'm sure I forgot some things, but this is basically what we're looking at.

This is turning out to be a long hackathon πŸ˜…

mariusandra commented 3 years ago

Tasks regarding plugins are now tracked in this project

jamesefhawkins commented 3 years ago

A few thoughts on stuff that would help these launch successfully:

Depending on your reaction to above, perhaps we should clarify on the project what is a blocker to launching?

mariusandra commented 3 years ago

Over the last few days plugins have gotten decidedly more exciting.

When PR #2743 lands (and https://github.com/PostHog/posthog-plugin-server/pull/67), we will support:

Both features have their gotchas and are excitingly beta, yet, excitingly, they work well enough for a lot of use cases.

Check it while it lasts. The Heroku Review App for this branch contains a few fun plugins.

1. The "github metric sync" plugin.

Not yet the full stargazers sync, but just syncing the number of stars/issues/forks/watchers as a property every minute:

Screenshot:

image

Code:

async function runEveryMinute({ config }) {
    const url = `https://api.github.com/repos/PostHog/posthog`
    const response = await fetch(url)
    const metrics = await response.json()

    posthog.capture('github metrics', {
        stars: metrics.stargazers_count,
        open_issues: metrics.open_issues_count,
        forks: metrics.forks_count,
        subscribers: metrics.subscribers_count
    })
}

All events captured in a plugin via posthog.capture are sent directly into celery (bypassing the Django HTTP API overhead) and come from a unique user.

image

We can graph this. Our star count is steady!

image

2. The "Queue Latency Plugin"

This is a pretty quirky usecase.

// scheduled task that is called once per minute
function runEveryMinute() {
    posthog.capture('latency test', { 
        emit_time: new Date().getTime()
    })
}

// run on every incoming event
function processEvent(event) {
    if (event.event === 'latency test') {
        event.properties.latency_ms = new Date().getTime() - event.properties.emit_time
    }
    return event
}

Since the event is lpushed into a list in redis (sent to celery) and only later, read again via redis from the back of the queue, we can use this to measure the queue latency:

image

Using PostHog to measure PostHog. 🀯

Github star sync plugin

I started making a true github star sync plugin, but still have two blockers that need to be solved separately.

Even with these blockers, the plugin is currently possible.

Snowflake/BigQuery plugin

Segment in their functions exposes a bunch of node packages to the user:

image

With the maxmind package, I already had a bit of trouble including the .mmdb reader inside the final compiled plugin index.js file. I'm now afraid compiling all of @google-cloud/bigquery, which probably includes some protobuf files that are read through the filesystem via some compiled C code, into one index.js will prove hard. We'll probably need to expose some of these APIs directly to the user as well.

Other things to improve

There are so many things that can be improved. Browse the Heroku app and write the first 5 you find. Here are some random ones:

This is BETA

Plugins are, while legitimately powerful, are still legitimately beta.

The next step is to get this running on cloud and get the snowflake and bigquery plugins out.

mariusandra commented 3 years ago

Here it is πŸ₯ πŸ₯ πŸ₯ the github star sync plugin:

async function runEveryMinute({ cache }) {
    // if github gave use a rate limit error, wait a few minutes
    const rateLimitWait = await cache.get('rateLimitWait', false)
    if (rateLimitWait) {
        return
    }

    const perPage = 100
    const page = await cache.get('page', 1)
    // I had to specify the URL like this, since I couldn't read the headers of the original request to get
    // the "next" link, in which `posthog/posthog` is replaced with a numeric `id`.
    const url = `https://api.github.com/repositories/235901813/stargazers?page=${page}&per_page=${perPage}`

    const response = await fetch(url, {
        headers: {'Accept': 'application/vnd.github.v3.star+json'}
    })

    const results = await response.json();
    if (results?.message?.includes("rate limit")) {
        await cache.set('rateLimitWait', true, 600) // timeout for 10min
        return
    }

    const lastCapturedTime = await cache.get('lastCapturedTime', null)
    const dateValue = (dateString) => new Date(dateString).valueOf()
    const validResults = lastCapturedTime 
        ? results.filter(r => dateValue(r.starred_at) > dateValue(lastCapturedTime))
        : results
    const sortedResults = validResults.map(r => r.starred_at).sort()
    const newLastCaptureTime = sortedResults[sortedResults.length - 1]

    for (const star of validResults) {
        posthog.capture('github star!', {
            starred_at: star.starred_at,
            ...star.user,
        })
    }

    if (newLastCaptureTime) {
        await cache.set('lastCapturedTime', newLastCaptureTime)
    }

    if (results.length === perPage) {
        await cache.set('page', page + 1)
    }
}

I would like an option to specify a custom timestamp for my event. Other than that, it works! What's more, it makes only 60 requests per minute, keeping below Github's free usage API rate limits :).

paolodamico commented 3 years ago

Pretty exciting updates @mariusandra, thanks for sharing it in such detail! Would like to start writing out a plugin really soon. In the meantime let me know if I can help with the UI/UX to better communicate the new functionality/workflow.

weyert commented 3 years ago

Cool, looks nice, which external node modules are supported? I assume you need to preinstall and/or white list them?

mariusandra commented 3 years ago

There are two ways to include external modules.

  1. Compile them into one index.js like the posthog-maxmind-plugin does. This should be the preferred option, as in general the fewer external dependencies the better.
  2. Make a PR to include them in posthog-plugin-server directly.

Right now only fetch (maps to node-fetch) is available, though I think we'll already have a few other things for the next release.

For reference, segment does something similar as well.

mariusandra commented 3 years ago

Memory benchmarks!

As it is built now, posthog-plugin-server isolates each plugin inside a VM (via vm2). All plugins are also isolated per team (per project). This means 100 projects using the "helloworldplugin" spin up 100 VMs, even if it's exactly the same code they're all running.

So how heavy is a VM? (Un?)Surprisingly, not at all! A simple plugin VM takes about 200KB of memory. A more complicated plugin (100kb of posthog-maxmind-plugin/dist/index.js) takes about 250KB. Thus running 1000 VMs in parallel consumes an extra 250MB of RAM. Said differently, if 1000 customers on cloud enable one plugin, the server's memory footprint will grow 250MB per worker.

Obviously a VM that loads a 70MB database and keeps it in memory throughout its lifetime will consume more memory, but for all intents and purposes VMs are very light.

Originally I had imagined a "shared plugin" system for "multi tenancy" (cloud), where we spin up a bunch of shared VMs that can just be enabled/disabled per team. However I could never get over the danger of leaking data. For example when one processEvent stores something about the event on the global object and reads it again next time it's run, but it's an event for a different team. I thought the best way around this was to just whitelist a bunch of trusted plugins that cloud users can run, greatly eliminating this threat.

Now I'm thinking differently. With such a light footprint, we can spin up a new VM for each team that wants to use a plugin, thus completely separating the data in memory. If the number gets more than one CPU core can handle (so over 10k plugins in use?), we can split the work and scale horizontally as needed.

For enterprise customers using PostHog Cloud, we could provide an additional worker-level or process-level isolation. This is what cloudflare does - they split the free workers and the paid client's workers into separate clusters. In our case, with thread level isolation on cloud, each paying customer could get their own worker (aka CPU thread) that runs all their plugins. These workers could be automatically spun up and down as the load changes by the plugin server, protecting paying customers from broken and runaway plugins made by other customers. With something this, we could even enable the plugin editor for all paying customers.

We're really making a lambda here :).

mariusandra commented 3 years ago

It's been 1.5+ months (including the Christmas break) since the last update, so time for a refresher!

The big big change that has happened since then is that event ingestion is now handled by the plugin server! This is still beta and disabled by default, but when enabled, events after being processed by the plugins, are ingested (stored in postgres or clickhouse) directly inside the plugin server. For Postgres/Celery (OSS) installations, this avoids one extra step. For ClickHouse/Kafka (EE) installations, this makes using plugins possible, as with this setup we have nowhere to send the event after the plugins have finished their work.

The work in the next weeks will be to stabilise this ingestion pipeline and enable it for all customers on PostHog Cloud. Currently we're bottlenecked to ~100 events/sec per server instance (even less for long waiting plugins) and this needs to be bumped significantly. Only after can we enable plugin support for all cloud users. Hopefully next week :).

Other notable changes in the last month or so:

All that said, with the launch of plugins on cloud (already enabled for some teams to test), we're entering a new era for the plugin server. From now on we must be really careful not to trip anything up with any change and religiously tests all new code!

We also introduced quite a bit of technical debt with the 4000 changed lines of the ingestion PR (all the magic SQL functions, database & type mapping, etc). This needs to be cleaned up eventually.

While we've gotten very far already, there are many exciting changes and challenges still to come. For example:

And then we'll get to the big stuff:

paolodamico commented 3 years ago

I think this has evolved in a bunch of different places and can now be closed? @mariusandra

mariusandra commented 3 years ago

@paolodamico I think this can indeed be closed, but not before one last update!

One last Plugin Server update

It's been 3.5 months since the last update. Let's check in on our contestants.

What we have been building with plugins is something unique... something that in its importance and its value to the bottom line has a legit opportunity overtake all other parts of PostHog (though won't be our defining feature since it's already built).

The Plugin Server has turned PostHog into a self-hosted and seriously scalable IFTTT / Zapier / Lambda hybrid, with RDS, ElasticCache, SQS and other higher abstractions baked right in.

It has become serious application platform on its own.

(Seriously, it has. Check out this 45min talk LTAPSI - Let's Talk About Plugin Server Internals for more)

Combine this with a scalable event pipeline, and you can build some really cool shit. Web and product analytics? So 2020. Here are some more exotic ideas:

Oh and PostHog can still do web and app analytics, session recording, heatmaps, feature flags, data lakes, etc, etc ad nauseam :).

Current state

Plugins now power the entire ingestion pipeline. On PostHog cloud, one plugin server can ingest at most a thousand events per second.

Plugins are now used by many self-hosted and cloud customers to augment their data and to export it to various data lakes. We have had several high quality community plugins come in, such as sendgrid and salesforce (should be added to repo?). We've had entreprise customers write their own 700-line plugins to streamline data ingestion.

You just need to write the following to have an automatic batched and retry-supported data export plugin

import { RetryError } from '@posthog/plugin-scaffold'

export async function exportEvents (events, { global, config }) {
    try {
        await fetch(`https://${config.host}/e`, {
            method: 'POST',
            body: JSON.stringify(events),
            headers: { 'Content-Type': 'application/json' },
        })
    } catch (error) {
         throw new RetryError() // ask to retry
    }
}

If you throw the error, we will try running this function again (with exponential backoff) for around 48h before giving up.

We now have a bunch of functions you can write: onEvent, onSnapshot, processEvent, exportEvents, runEveryMinute, runEveryHour, runEveryDay.

You can export jobs to have background tasks, which you can soon even run from the UI https://github.com/PostHog/plugin-server/pull/414

export const jobs = {
    performMiracleWithEvent (event, meta) {
        console.log('running async! possibly on a different server, YOLO')
    }
}
export function processEvent (event, { jobs }) {
    jobs.performMiracleWithEvent(event).runIn(3, 'minutes')

    event.properties.hello = "world"
    return event
}

Here's real feedback from a customer that we received (name omitted just in case):

"The power of being able to write a little plugin in 100 lines of JS is just amazing. Can't wait to break out of all our Amplitude/GA/FullStory stack"

Since the last update 3.5 months ago we have built: injecting loop timeouts via babel, polished CD, implemented a bump patch GH action releasing system, put ingestion live, did a lot of debugging to find the need to add redis connection pools, implemented lazy vms, implemented plugin access control for cloud, added built in geoip support, added the snowflake sdk, the AWS sdk, console logging, job queues, onEvent & onSnapshot, plugin capabilities, and the 185 other PRs that go under "keep the lights on" work.

We're only getting started :).

Next challenges

Look at the team extensibility project board for what we're working on now.

There's a lot of ongoing "keep the lights on" work, which will continue to take up most of the time going forward. This work is not exciting enough to mention here (90% of closed PRs the last 4 months for example), but absolutely important to get through.

From the big things, there are a few directions we should tackle in parallel:

Only when that's done, we could also look at UI plugins. Let's hold back here for now, as the frontend is changing so rapidly. Instead let's take an Apple-ish approach where we only expose certain elements that are ready, starting with the buttons to trigger jobs and displaying the output in the console logs.

Big thing to watch out for

I believe the biggest challenge for the plugin server will come in the form of flow control. The job queue next steps issue briefly talks about it.

The plugin server has just a limited amount of workers (40 parallel tasks on cloud). Imagine Kafka sending us a huge batch of events, and at the same time receiving a lot of background jobs, and running a few long running processEveryHour task. If in this scenario we ask piscina to run another 200 tasks, and keep adding more and more faster than old ones complete, we're going to run out of memory and crash with a lot of in-flight tasks.

To prevent this, there's a "pause/drain" system in place with most queues. We periodically check if piscina is busy, and if so, stops all incoming events/jobs/etc.

If we add more features and are not careful regarding flow control, we can run into all sorts of bottlenecks, deadlocks, and lost data. We must be terrified of issues with flow control if we're to build a project for the ages.

Related, the redlocked services (schedule, job queue consumer) are now bound to just running on one server. This will not scale either. There must be an intelligent way to share load and service maps between plugin server instances... without re-implementing zookeeper in TypeScript.

Last words

I'll close this issue now, as work on the plugin server is too varied to continue keeping track of in just one place.

I sincerely believe that what we have built with the PostHog Plugin Server is something unique, with limitless usecases, for personal and business needs alike. It's especially unique given it's an open source project.

Somehow it feels like giving everyone a new car for free.

I'm super excited to see what road trips the community will take with it :).