FoxxMD / context-mod

an event-based, reddit moderation bot built on top of snoowrap and written in typescript
https://contextmod.dev
MIT License
49 stars 11 forks source link

Hosting questions #120

Open CryptoMaximalist opened 1 year ago

CryptoMaximalist commented 1 year ago

Hello, per our discussion here is the github Issue for the things that I was unsure about after going through the hosting/operator docs

• minimum/recommended specs for hosting (CPU, RAM, Storage, bandwidth) (I set up on an Ubuntu 22.04 LTS VM with Docker) • I'm not sure what the operator reddit account is, or if it should be different than the bot account. • On the setup screen, I'm not sure what Instances are, but there was just 1 option so I proceeded • Caching ◇ “It is therefore important to re-use window criteria wherever possible to take advantage of this caching.” ▪ My takeaway from this and caching docs in general is that your first run should request the largest amount of history that will be needed anywhere in the config and then subsequent runs should match it. If you request 200 activities on the first run, it doesn't save you any API hits later to only look at 100. You don't want your runs to request their last 10 activities, then the last 30 on the next run, then the last 100. ◇ Redis must not be in the docker, web interface doesn't start if config file tries to use that as cache ◇ Are mod notes cached? userNotes are in the cache docs and I know ModNotes are said to be api intensive ◇ TTL ▪ Why does data expire? ▪ What are the implications or problems with setting these values too low or high? ▪ I'm guessing a week is way too long if the default is 1 minute, but I'm not sure why other than maybe storage space or the data is expected to change (but maybe that doesn't matter for the use case) • Database ◇ typo “retention: '3 months' # each subreddit will retain 3 more of recorded events” ◇ Migrations, is this something I have to worry about using the docker default db? It says it will pause startup and that could mean some troubleshooting on my headless server • Other ◇ Is there a log file to troubleshoot CM, in case the web server doesn't successfully start? ◇ How do CM updates happen with docker?

FoxxMD commented 1 year ago

minimum/recommended specs for hosting

This depends on what you want to do with CM, here's my ballpark:

System Specs

1 bot, 1 subreddit, no image processing

For each additional subreddit (regardless # of bots) "add" 5MB free memory, UP TO 500MB total (base + additional)

CM can work with more or less memory. The docker image targets 512MB but this can modified. Generally, less memory with "more" subreddits than the above recommendation results in a slower bot as node has to free up memory more often but it will still work!

With image processing

Image processing requires holding uncompressed image data in memory while it is manipulated. Add an additional 100-200MB+ of free memory on top of the base memory depending on usage of image processing.

Database Specs

Database should be determined largely by event volume.

Using sqlite is fine if total volume is very low. So could be running 20 subreddits as long as total aggregate volume for all subs is < 50/hour. Note: sqljs should ONLY be used for testing. better-sqlite3 (default with docker image) should always be used for production.

If volumes is higher than this a dedicated database like mysql or postgres may be better. Though better-sqlite3 should be fine for all but the highest volumes I think performance is better, in general, when using a dedicated database.

FoxxMD commented 1 year ago

I'm not sure what the operator reddit account is

The operator is whoever is running the actual CM instance. Side note: this can be different for the client and server CM instances but by default it is the same.

Specifying an operator reddit account is necessary in order for CM to know who (reddit account) is authorized to create new bots in the instance. It also gives the operator a more permissive dashboard -- operators can see all subreddits, logs, and configs in the dashboard. (They cannot change configs without guest access though).

On the setup screen, I'm not sure what Instances are

An instance, in this context, is the server component of CM that will run the actual bot. A CM client (web interface) can connect to multiple, independent CM server instances. In the default configuration there is only one client, one instance.

I should probably hide that field in the setup screen when there is only one instance available.

FoxxMD commented 1 year ago

Cache

My takeaway from this and caching docs in general is that your first run should request the largest amount of history that will be needed anywhere in the config and then subsequent runs should match it. If you request 200 activities on the first run, it doesn't save you any API hits later to only look at 100.

This is basically correct! I have plans to make "subset" requests fetchable from cache eventually.

Redis must not be in the docker

Yes, redis is not included in the CM docker image. It can be wired together with docker-compose or some other external service.

Are mod notes cached?

They are cached for one minute, by default.

TTL Why does data expire?

Data expires to avoid stale cache.

Here are all the things CM tries to cache from reddit and their TTLs Cache expires because user history, submission/comment state (stickied, reports, etc...), and other data from reddit changes.

What are the implications or problems with setting these values too low or high?

Setting cache TTLs too low would prevent CM from being able to reuse data and previous processing results which would force CM to make additional API calls to reddit.

For example, if you set authorTTL to 1 second and more than one second passed between instances where CM needed the same set of author history it would require 2 api calls, instead of 1 api call and a re-used cache result if the TTL was longer. Even if this request for author history was in the same activity being processed, but a later check in the config.

Setting cache TTLs too high increases the probability of stale data in the cache which could cause CM to incorrectly process an activity.

For example, if you set authorTTL to 1 week and had a rule that report/removed a comment if it had been repeated > 3 in a users history:


The caching defaults are not storage related but for preventing stale data usage. The defaults are intentional -- they provide CM a reasonable amount of time to reuse cache with a low probability of stale data.

They can, however, be changed per subreddit :) For instance if you have rules that only ever look at a user's initial history, but may have to do it repeatedly, you can safely increase authorTTL to reduce api calls. They can be set in the subreddit config at the top level:

caching:
  authorTTL: 600 # cache user history for 10 minutes
  modNotesTTL: 300 # cache mod notes for a user for 5 minutes

runs:
  # ...
FoxxMD commented 1 year ago

Database

Migrations, is this something I have to worry about using the docker default db

The docker image uses sqlite by default and will do automatic backups and migrations for you.

If you switch to mysql/postgres you can force migrations to run automatically like this (in the operator config):

databaseConfig:
  migrations:
    force: true # always run migrations

However CM also has a migration UI! If you start CM and it detects it requires a migration that cannot be done automatically you can visit the "dashboard" to get a migration confirmation page that lets you execute the migration.

Other

Is there a log file to troubleshoot CM, in case the web server doesn't successfully start?

I think CM should be logging to file by default. Check your cm data folder for a logs folder. If not, you can add this to the operator config to enable logging to file for warnings/errors:

logging:
  # default level for all logging
  level: debug
  file:
    # override default level
    level: warn
    # true -> log folder at DATA_DIR/logs
    # /home/myUser/logs -> absolute location of folder (remember this is in the container if using docker!)
    # ./myLogs -> relative location from DATA_DIR
    dirname: true

How do CM updates happen with docker?

Code used by a docker image

The CM code in a docker image is pinned to docker images tags that mirror release versions, the master branch (latest tag), and edge branch (edge tag).

If you use a release tag, EX 0.13.2, you will always get CM code pinned at that release including migrations present at that commit, etc..

If you use a branch tag (latest or edge) the image is updated every time I push code to the repo branch. The master branch (latest tag) is always the same as the latest release. The edge branch is "bleeding edge" IE nightly builds.

Upgrade process

CM is designed to not need the database or cache to operate. The database is basically for keeping track of statistics and Actioned Events for viewing prior bot history. If you removed the database and cache (or provided brand new instances of both) CM will happily use them. So migrating a database is only necessary to make sure CM history is preserved.

The config, which is persisted in the DATA_DIR host folder, is not affected by upgrades or a new CM instance image.

When CM container is started (regardless of upgrade or not):