StackStorm / st2ci

New and improved continuous integration actions and workflows
Apache License 2.0
3 stars 9 forks source link

Migrate internal monitoring to free 3rd party service #208

Open arm4b opened 2 years ago

arm4b commented 2 years ago

Internal infrastructure includes a st2monitoring server with the dashboard and client checks (services, memory, processes, ports) for each internal infra node including st2cicd server, as well as external checks (APIs, SSL cert expiry, Domains, ST2 websites availability health checks).

In order to reduce the amount of infra, costs, moving pieces, and relying less on AWS resources (see https://github.com/orgs/StackStorm/projects/27), remove the st2monitoring server and start migrating to free 3rd party service for monitoring and alerting.

For example, we could use Scalyr (where @Kami works).

There are several sub-tasks here:

Example with external checks: monitoring

Example for st2cicd server: image

Finishing the first part with migrating the external checks would be already great. We can remove the monitoring at that point which would save us $60/mo in AWS.

Kami commented 2 years ago

I created a repo with JSON definitions for (remote) monitors and alerts which are automatically deployed to DataSet account on push / merge - https://github.com/StackStorm/dataset-scalyr-resources.

To begin with, I started with a private repo, but if the repo won't contain any secrets, we can also make it public.

For other non-HTTP based monitors, we will need to define agent based monitors (that also includes HTTP cert and domain expiration since that functionality is not directly supported by the remote monitors).

Having said that - we need to decide how to install and manage the agent and on which hosts (just cicd or also some other host?).

Ideally we would use infra as code approach for installing the agent and managing the agent config. One option would be store agent config in the same repo (dataset-scalyr-resources) and then pull the config down during the agent install / deploy job. Another one would be to store it in the same repo which contains code (chef cookbook or whatever) to install the agent - although I would prefer the first approach to keep all the config files in a single location.

Another thing - to which email address should alerts go to? redacted@ or do we have a dedicated address for alerts?

arm4b commented 2 years ago

For the alerts, #opstown Slack monitoring channel would work best as other alerts already go there. in the past you already played with that in the same channel: image

Kami commented 2 years ago

OK, so far all the host (agent) based checks for st2cicd has been ported - http://monitoring001:3000/#/client/sensu/st2cicd042.uswest2.stackstorm.net.

Which other clients / hosts do we want to port the checks for? Aka on which hosts the agent also needs to be installed + monitors + alerts set up.

arm4b commented 2 years ago

st2cicd host is sufficient, we'll likely to get rid of everything else.

Kami commented 2 years ago

@armab I believe I migrated all the checks for st2cicd now. This includes "remote" SSL cert and domain expire checks, but those utilize agent monitor since DataSet doesn't support those remote checks natively.

Would be good if you double checked nothing is missing when you get a chance - https://app.scalyr.com/alerts?teamToken=BLSvhkqnK81b_wD2KhjsoQ--.

I also still need to adjust some thresholds and verify that indeed all alerts are set up correctly - aka that they trigger when they should.

I also set up log ingestion for StackStorm services logs in case they may help us with troubleshooting. They seem to be low volume so it shouldn't cause any log volume related issues. In case it does, we can always disable them. Only exception is MongoDB, that log seems to grow like crazy so I removed that file.