NixOS / infra

NixOS configurations for nixos.org and its servers
MIT License
208 stars 91 forks source link

Alert on nixos.org Matrix federation issues #410

Closed delroth closed 2 months ago

delroth commented 2 months ago

We recently had an outage caused by .well-known removal on nixos.org breaking our Matrix federation. We did not detect that outage until it was manually found by users of the service.

I know @robryk has some snippet of NixOS configuration somewhere which configures Prometheus to poll the Matrix federation checker's JSON endpoint to detect successes/failures, this sounds like something we should be doing on pluto and get alerts for :)

Erethon commented 2 months ago

I'm new to NixOS so this might not be the optimal way to achieve this, but this seems to do the job. It exports a metric called federationok that is 1 if everything is ok, else it's 0:

  services.prometheus.exporters.json = {
    enable = true;
    configFile =
      (pkgs.formats.yaml { }).generate "prometheus-json-exporter-config" {
        modules = {
          matrixfederation = {
            metrics = [{
              name = "federationok";
              help = "FederationOK status";
              path = "{ .FederationOK }";
            }];
          };
        };
      };
  };

and

services.prometheus = {
    enable = true;
    extraFlags = [ "--storage.tsdb.retention.time 60d" ];          
    scrapeConfigs = [{
        job_name = "federationtester";
        scrape_interval = "300s";
        static_configs = [{
          targets = [ "127.0.0.1:7979" ];
          labels = { instance = "nixos.org"; };
        }];
        metrics_path = "/probe";
        params = {
          module = [ "matrixfederation" ];
          target = [
            "https://federationtester.matrix.org/api/report?server_name=nixos.org"
          ];
        };
      }
    ];
  };
delroth commented 2 months ago

This is extremely similar to the config fragment I was going to steal get inspired from, thank you! :)

https://git.sr.ht/~robryk/nixos-config/tree/master/item/modules/monitoring-matrix.nix