Closed delroth closed 2 months ago
I'm new to NixOS so this might not be the optimal way to achieve this, but this seems to do the job. It exports a metric called federationok
that is 1 if everything is ok, else it's 0:
services.prometheus.exporters.json = {
enable = true;
configFile =
(pkgs.formats.yaml { }).generate "prometheus-json-exporter-config" {
modules = {
matrixfederation = {
metrics = [{
name = "federationok";
help = "FederationOK status";
path = "{ .FederationOK }";
}];
};
};
};
};
and
services.prometheus = {
enable = true;
extraFlags = [ "--storage.tsdb.retention.time 60d" ];
scrapeConfigs = [{
job_name = "federationtester";
scrape_interval = "300s";
static_configs = [{
targets = [ "127.0.0.1:7979" ];
labels = { instance = "nixos.org"; };
}];
metrics_path = "/probe";
params = {
module = [ "matrixfederation" ];
target = [
"https://federationtester.matrix.org/api/report?server_name=nixos.org"
];
};
}
];
};
This is extremely similar to the config fragment I was going to steal get inspired from, thank you! :)
https://git.sr.ht/~robryk/nixos-config/tree/master/item/modules/monitoring-matrix.nix
We recently had an outage caused by .well-known removal on nixos.org breaking our Matrix federation. We did not detect that outage until it was manually found by users of the service.
I know @robryk has some snippet of NixOS configuration somewhere which configures Prometheus to poll the Matrix federation checker's JSON endpoint to detect successes/failures, this sounds like something we should be doing on pluto and get alerts for :)