delvtech / agent0

Analysis & simulation repo for Delv
https://agent0.readthedocs.io/en/latest/
Apache License 2.0
49 stars 21 forks source link

Productizing bot deployment #1541

Open dpaiton opened 3 months ago

dpaiton commented 3 months ago

Tasks

responsibility

All people listed should

  1. know how to (& have credentials to) restart and/or deploy bots
  2. monitor bot-related rollbar notifications; check that any critical bugs are being addressed
  3. understand error prioritization and know the failure playbook

importance (priority)

  1. invariant fails (page @jalextowle @jrhea @mcclurejt )
  2. checkpoint bot tries to checkpoint & fails
  3. checkpoint bot goes down
  4. invariant goes down

top priorities for mainnet

bots to consider

documentation

uptime monitoring

error reporting & notifications

easy start & restart

containerized deployment

invariant checks

credentials storage

continuous deployment

current status -- checkpoint bot:

slundqui commented 3 months ago

Readme on deploying bots within https://github.com/delvtech/hyperdrive-infra/pull/119

slundqui commented 3 months ago

Something to note is that rollbar doesn't have a great way to log "this process is dead". May need a separate "monitoring" container that logs errors if the service bots containers are stopped, or we allow docker to always restart. Even then, if the aws machine goes down, there's no way of logging an "this is down" error

wakamex commented 3 months ago

I changed the second-last bullet from document machine details (ip, port) and make sure everyone has ssh access to make sure everyone has access. Originally we envisioned using AWS, but @mcclurejt convinced me fly.io is way easier. We won't need individual ssh keys. But I'll still go through and make sure everyone has access, so I tagged myself to the bullet.