PostHog / posthog.com

Official docs, website, and handbook for PostHog.
https://posthog.com
Other
433 stars 442 forks source link

Document Plugins for release v1.19.0 #757

Closed mariusandra closed 3 years ago

mariusandra commented 3 years ago

Few things will change for 1.19 that need to be documented:

1) The addition of processEventBatch(events: PluginEvent[], meta: PluginMeta) https://github.com/PostHog/posthog-plugin-server/pull/39

You can either define processEvent, processEventBatch or both in your plugin - the other one is created automatically. Currently processEventBatch is not in use - it will only gets batches of one event since that's how we talk to Celery. In the future and especially on EE or cloud where we use Kafka to get events in batches, we will also pass the received events to this function as a batch. We might also add some kind of batching for celery.

The idea is if you're sending events to S3, you don't want to make 100 requests (e.g. per second), one for every event. It would be better to make one request and send 100 events at once.

To prevent any data leakage, the raw Kafka events are further split into batches per team before reaching processEventBatch. This makes sense since plugins are also enabled on a team-by-team basis now.

We haven't defined a limit for the batch size yet. Currently node-rdkafka is configured to get events in batches of 100, but this might change as we perform more benchmarks. I don't expect the batch size to go over 1000 though. It should remain within the realm of "can submit in a POST request", even if it'll be a ~500kb request (1000 events of 500 bytes?).

2) meta.cache.incr and meta.cache.expire.

Plus adding ttlSeconds to meta.cache.set and returning a promise from it.

https://github.com/PostHog/posthog-plugin-server/pull/42/files#diff-375753e4853c3395064f0dd9469cd7995477be5f2f20f881ef930c7594fb674e

3) The plugin server can now be configured with ENV variables. If you run posthog-plugin-server --help (cd plugins && yarn start --help in the posthog app), you'll see this:

$ posthog-plugin-server --help
Options:
      --version                        Show version number                                         [boolean]
  -c, --config                         Config options JSON.                                         [string]
      --celery-default-queue           celery outgoing queue [celery]                               [string]
      --database-url                   url for postgres [postgres://localhost:5432/posthog]         [string]
      --plugins-celery-queue           celery incoming queue [posthog-plugins]                      [string]
      --redis-url                      url for redis [redis://localhost/]                           [string]
      --base-dir                       base path for resolving local plugins [.]                    [string]
      --plugins-reload-pubsub-channel  redis channel for reload events [reload-plugins]             [string]
      --disable-web                    do not start the web service [false]                        [boolean]
      --web-port                       port for web server [3008]                                   [number]
      --web-hostname                   hostname for web server [0.0.0.0]                            [string]
      --worker-concurrency             number of concurrent worker threads [0]                      [number]
      --tasks-per-worker               number of parallel tasks per worker thread [10]              [number]
      --log-level                      minimum log level [info]                                     [string]
      --sentry-dsn                     sentry ingestion url [null]
      --help                           Show help                                                   [boolean]

All these config options are configurable with ENV variables. Just convert the config key to uppercase and replace "-" with "_". For example --database-url becomes DATABASE_URL.

When running the plugin server via bin/plugin-server (set in most scripts), we fetch and pass these keys from django:

KEYS="DATABASE_URL REDIS_URL PLUGINS_CELERY_QUEUE CELERY_DEFAULT_QUEUE BASE_DIR PLUGINS_RELOAD_PUBSUB_CHANNEL"

The others could be set via env variables in your cloud of choice.

The important ones that you might want to tweak are WORKER_CONCURRENCY and TASKS_PER_WORKER. While worker concurrency is taken from the nr of CPUs available, you might want to fine tune it. The TASKS_PER_WORKER env specifies how many "async" tasks the worker will run in parallel. I'm not yet sure what's the best value here. 10 seems safe. 100 seems fine, thought might not be if every async processEvent makes a HTTP request to the same sever. This all needs to be tested, so the param is here to be tuned.

yakkomajuri commented 3 years ago

A few questions:

mariusandra commented 3 years ago
  1. The idea is that for simple or sync plugins, just create a processEvent function and that's that. If you need more control with async operations, change this to processEventBatch. If you don't supply your own batch function, we use our own that just asynchronously calls processEvent for each event (or synchronously if no promise is returned).

If you have both defined, only one of them will be called. For now just processEvent, but this will change once we get Kafka running

  1. On Heroku, the worker dyno starts both the pluginworker and celeryworker processes. However if you need to scale, you should add more of either one or the other. Together in one instance they are breaching the per-dyno memory limits and if you have a bigger app with many plugins, it's wise to launch separate dynos and shut down the default worker.
yakkomajuri commented 3 years ago

Great, thanks! This saves me some time

mariusandra commented 3 years ago

Some last minute changes on Friday that got merged Monday added three very cool features to plugins in 1.19. Would be awesome to document them.