Document Plugins for release v1.19.0

mariusandra commented 3 years ago

Few things will change for 1.19 that need to be documented:

1) The addition of processEventBatch(events: PluginEvent[], meta: PluginMeta) https://github.com/PostHog/posthog-plugin-server/pull/39

You can either define processEvent, processEventBatch or both in your plugin - the other one is created automatically. Currently processEventBatch is not in use - it will only gets batches of one event since that's how we talk to Celery. In the future and especially on EE or cloud where we use Kafka to get events in batches, we will also pass the received events to this function as a batch. We might also add some kind of batching for celery.

The idea is if you're sending events to S3, you don't want to make 100 requests (e.g. per second), one for every event. It would be better to make one request and send 100 events at once.

To prevent any data leakage, the raw Kafka events are further split into batches per team before reaching processEventBatch. This makes sense since plugins are also enabled on a team-by-team basis now.

We haven't defined a limit for the batch size yet. Currently node-rdkafka is configured to get events in batches of 100, but this might change as we perform more benchmarks. I don't expect the batch size to go over 1000 though. It should remain within the realm of "can submit in a POST request", even if it'll be a ~500kb request (1000 events of 500 bytes?).

2) meta.cache.incr and meta.cache.expire.

Plus adding ttlSeconds to meta.cache.set and returning a promise from it.

https://github.com/PostHog/posthog-plugin-server/pull/42/files#diff-375753e4853c3395064f0dd9469cd7995477be5f2f20f881ef930c7594fb674e

3) The plugin server can now be configured with ENV variables. If you run posthog-plugin-server --help (cd plugins && yarn start --help in the posthog app), you'll see this:

$ posthog-plugin-server --help
Options:
      --version                        Show version number                                         [boolean]
  -c, --config                         Config options JSON.                                         [string]
      --celery-default-queue           celery outgoing queue [celery]                               [string]
      --database-url                   url for postgres [postgres://localhost:5432/posthog]         [string]
      --plugins-celery-queue           celery incoming queue [posthog-plugins]                      [string]
      --redis-url                      url for redis [redis://localhost/]                           [string]
      --base-dir                       base path for resolving local plugins [.]                    [string]
      --plugins-reload-pubsub-channel  redis channel for reload events [reload-plugins]             [string]
      --disable-web                    do not start the web service [false]                        [boolean]
      --web-port                       port for web server [3008]                                   [number]
      --web-hostname                   hostname for web server [0.0.0.0]                            [string]
      --worker-concurrency             number of concurrent worker threads [0]                      [number]
      --tasks-per-worker               number of parallel tasks per worker thread [10]              [number]
      --log-level                      minimum log level [info]                                     [string]
      --sentry-dsn                     sentry ingestion url [null]
      --help                           Show help                                                   [boolean]

All these config options are configurable with ENV variables. Just convert the config key to uppercase and replace "-" with "_". For example --database-url becomes DATABASE_URL.

When running the plugin server via bin/plugin-server (set in most scripts), we fetch and pass these keys from django:

KEYS="DATABASE_URL REDIS_URL PLUGINS_CELERY_QUEUE CELERY_DEFAULT_QUEUE BASE_DIR PLUGINS_RELOAD_PUBSUB_CHANNEL"

The others could be set via env variables in your cloud of choice.

The important ones that you might want to tweak are WORKER_CONCURRENCY and TASKS_PER_WORKER. While worker concurrency is taken from the nr of CPUs available, you might want to fine tune it. The TASKS_PER_WORKER env specifies how many "async" tasks the worker will run in parallel. I'm not yet sure what's the best value here. 10 seems safe. 100 seems fine, thought might not be if every async processEvent makes a HTTP request to the same sever. This all needs to be tested, so the param is here to be tuned.

yakkomajuri commented 3 years ago

A few questions:

If I have both processEvent and processEventBatch, what happens? From my understanding the server first runs all processEvent functions available and then moves on to all processEventBatch. So is processEvent going to modify the event and then it will be dumped into processEventBatch? Just trying to understand this specific case a bit more.
Asking before I go deep into it: How's the plugin-server structured in the deployments currently? e.g. how come I can run plugins on Heroku without the plugin server?

mariusandra commented 3 years ago

The idea is that for simple or sync plugins, just create a processEvent function and that's that. If you need more control with async operations, change this to processEventBatch. If you don't supply your own batch function, we use our own that just asynchronously calls processEvent for each event (or synchronously if no promise is returned).

If you have both defined, only one of them will be called. For now just processEvent, but this will change once we get Kafka running

On Heroku, the worker dyno starts both the pluginworker and celeryworker processes. However if you need to scale, you should add more of either one or the other. Together in one instance they are breaching the per-dyno memory limits and if you have a bigger app with many plugins, it's wise to launch separate dynos and shut down the default worker.

yakkomajuri commented 3 years ago

Great, thanks! This saves me some time

mariusandra commented 3 years ago

Some last minute changes on Friday that got merged Monday added three very cool features to plugins in 1.19. Would be awesome to document them.

Scheduled tasks. These are already documented! Thanks! I left some feedback on the docs update PR regarding them.
posthog.capture(event, properties) -- does what it says it'll do. It bypasses the JS libraries and Django HTTP server and directly puts an event into celery. There is no other posthog.* function (e.g. identify) right now. This capture can be called anywhere, including in processEvent, but then it will probably lead to an endless loop, as every event will emit a new event... or two. Thus it's wise to only call it within the setupPlugin and runEveryX functions. There's a chance I'll disable it altogether for processEvent at some point. Not sure.
Plugin Editor. This is the main thing that necessitates an overhaul of the docs. All the information inside the docs is still valid, yet the spirit has changed a lot. Now you can just copy/paste a bit of javascript and you have a plugin. No longer must you create a repository on github, upload a package to npmjs or have a localhost posthog environment running. Click "new plugin", enter a name and write JavaScript (running in a VM in NodeJS 14, so everything that's supported in node 14 works, including ?.). The editor is still extremely raw, yet completely changes the way you would approach plugins. We will still have the plugin repository with a bunch of whitelisted plugins, so all of that will remain like it is.

PostHog / posthog.com

Document Plugins for release v1.19.0 #757