Open mapl opened 3 years ago
Hello @mapl.
This is quite an edge case, so honestly, I'm not sure I'll be personally working on this unless more people show interest for this feature (based on the number of 👍 on the issue).
The easiest way to implement this that I can think of is by having a way to allow pushing data into Gatus, as opposed to the current behavior, which only allows retrieving data from Gatus. By leveraging this as well as a configuration that allows specifying whether a Gatus instance is the primary instance or a secondary instance, the latter required to specify the endpoint of the primary instance, it would be possible to have a "global" dashboard and multiple Gatus instances configured independently.
Fortunately, the easiest way is also the most convenient one, because the other ones would likely involve persistence.
There's an even easier solution, but it assumes that the users accessing the dashboard has access to all "security subnets/zones", which I'm not sure is the case based on your explanation. The only work required for to send a request from the Gatus' dashboard frontend to each backends and merge the statuses.
I made a quick diagram about a simple distributed Gatus Deployment where the main Gatus Instance just pulls data from remote Gatus instances.
The Data from each remote Gatus instance is just embedded into the Main Gatus Dashboard.
I think this is a good idea. I would like it if there was
That looks good.
@mapl What do you think would be the appropriate behavior when there are overlapping service names?
Also, how about we do the opposite: the remote Gatus instances push their data to the main Gatus instance? I think that would allow a lot more flexibility, especially if, for instance, one of the remote Gatus instances is running in an environment completely inaccessible from the main Gatus instance (i.e. locally).
Of course, this would require a layer of security, but I built something oddly relevant to this specific use case: https://github.com/TwiN/g8
We'd also need to add something that periodically cleans up services that haven't been refreshed in a long time (i.e. in case a remote instance is taken offline, we don't want to keep the outdated service health checks on the dashboard forever).
Hmm... not sure. Is this actually supposed to be a problem? A UUID would solve the most of the issues I think.
A steady Health Status of the remote Gatus instance would be cool, like the last reached Status, just to know if its still alive.
There is and always been an ongoing debate about Pulling VS Pushing when it comes to monitoring.
For example, if you have 100 remote Gatus instances and each of those instances would push data to the main Gatus instance, the main instance is quickly subject to an overload of metric data.
I am wondering if it is enough to just proxy through data from the remote instances to the main instance, or is caching needed?
The goal should always be a simple as possible design.
https://dave.cheney.net/2019/07/09/clear-is-better-than-clever
An API Token is commonly used to restrict permissions. What's your opinion on this?
Anyway, a very goof reference is Prometheus and why the devs decide to go with pulling rather than pushing.
However, that's not a one way road, so it depends on the scenario. For the most part Pulling is the better choice.
https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/
https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
Fair enough, though if we're pulling, then there's no need for a token since the endpoint needs to be public for the dashboard to be shown.
Sounds good!
If you can ensure that the main Gatus instance can reach its remote instances via http(s) to fetch the json data, it should be fine
What do you think of this config idea snippet to be deployed on the main instance? I thought an Api key could be handy but not required as you mentioned.
remote-services:
- name: Gatus Remote Instance 1
url: http://10.10.10.10/api/v1/statuses
api-key: XXXXXXXXXXXXXX
interval: 10s
- name: Gatus Remote Instance 2
url: http://10.10.10.20/api/v1/statuses
api-key: XXXXXXXXXXXXXX
interval: 5s
- name: Gatus Remote Instance 3
url: http://10.10.10.30/api/v1/statuses
api-key: XXXXXXXXXXXXXX
interval: 10s
I was thinking something along the lines of
remote:
instances:
- name: "gatus-internal"
url: "http://10.10.10.10/api/v1/statuses"
interval: 30s
because in the future, we may have to add other parameters specific to the remote
configuration, such as:
strategy: merge
: merge all statuses fetched from the remote instances with existing, overlapping service names, or create new statuses if they already exist
strategy: prefix-service
: Prepend each retrieved service names with the configured name of the remote instance
strategy: prefix-group
: Prepend each retrieved group names with the configured name of the remote instance
Anyways, this isn't for right now, but I still think it's good to make the configuration as extensible as possible to prevent future breaking changes.
Absolutely, the config should be as extensible as possible and future proof where possible. Good point. Your concerns about possible service names clashes when the main and remote instances have the same names could be an actual issue if you put them in one flat view. I think, if every remote instance is moved into its own, lets says "container" box, then you could clearly derive where its origin is. So if you have duplicate names in this case, it wouldn't matter as they are displayed in their own container, in a way you added this nice grouping feature. Additionally, you could also check the health state of the entire remote instance, about last time reached for instance, latency, etc.
I've been thinking a bit more about the implementation for this, and one issue I can think of is how to handle alerting.
Assuming that the purpose is to monitor internal applications within a network that isn't publicly accessible by other users, but is accessible by a "remote" Gatus instance, how would alerting for a "remote" Gatus instance be handled, assuming that the alerting configuration as well as the individual alerts for each services are not exposed through Gatus' main endpoint (/api/v1/statuses
)?
I think it's safe to assume that each remote instances are expected to deal with their own alerts and that the only difference between the remote instance(s) and the main instance is that the main instance's dashboard must display the statuses from the remote instances as well.
All in all, I don't think this is a problem, but felt like specifying this through a comment would be worthwhile for the sake of traceability.
Maybe would be a good idea to have different systems? One implemented with a check-in system, and another pulling data from intranet networks.
Check-in system: Could give the control to developers to implement maybe some logic into the check (for example, in our systems we use healthchecks to run certain tests and if everything goes well, we send the ok to the healthchecks api endpoint, so we know that we are not only receiving that the server responds and is alive, plus adding complementary logic to the checks. For example, knowing that in our databases we're receiving data or our auth service is working properly, for example)
About the alerting on that mode of check-in, it is necessary to setup the expectation in time when the check-in must be in, and if is delayed, starts alerting.
Haven't really had the time to work on this, unfortunately, but #124 may make implementing this much easier if we were to leverage a global database.
Each individual Gatus would be in charge of alerting for their respective services, but the data would be retrieved from a global database that they all share (though this could be made configurable, in that the instance could choose to retrieve only the services it's monitoring, or all services present in the database)
There's obviously a few things that would need some thinking, like how to detect when one of the Gatus' instance configuration no longer has a given service (because it was deleted) so that we can automatically delete them.
Currently, since there's only one instance, there's no problem, but in a distributed setting, that won't work without a consensus of some sort, or maybe a table with a column to differentiate each individual Gatus instance as well as all the services registered under that instance could suffice? e.g.
gatus-1
has service-a
and service-b
gatus-2
has service-c
and service-d
gatus-2
's configuration is modified to remove service-d
, gatus-2
would update the table to remove service-d
because according to the table, gatus-2
was previously assigned to service-d
but service-d
is no longer in the configurationThere are good ideas in here. I hope this feature release as soon as possible. I'm looking forward to it 🙂
FYI: with #136 merged, #124 is not that far off.
Note to self: Will probably need to add a parameter to control https://github.com/TwinProduction/gatus/blob/acb6757dc800b43b5a24e1fbe0ebf9f64b42df4f/storage/store/store.go#L25-L28
Just to add on https://github.com/TwinProduction/gatus/issues/64#issuecomment-861882360 and https://github.com/TwinProduction/gatus/issues/64#issuecomment-896402326:
After giving it some thought, this is much easier than I initially anticipated.
The easiest, most barebone implementation I can think of is the following:
distributed.enabled
.
true
, three things happen:storage.type
must be set to postgres
, or the application fails to startDeleteAllServiceStatusesNotInKeys
is never called. If a service is removed from one of the Gatus instances, it must be deleted from the database manually (keep in mind that this is a barebone implementation/MVP. Support for automatically cleaning up could be added later)And that's it! Even by a conservative estimate, this is less than a week of work. I just don't know when I'll find the time to work on it.
Just thinking about the possibilities that this could bring hypes me up quite a bit.
Consider the following: You're provisioning a fleet of clusters in different Cloud providers/accounts, and each of them have their own Gatus instance -- all of which are configured to monitor their respective "private" cluster, but they're pushing the data in a global Postgres database. To wrap everything up, you have a single "global" Gatus instance which doesn't monitor anything, but it's publicly available, and it exposes the data from each individual cluster.
Anyways, I digress.
I am looking forward to the implementation of this feature
I propose one solution: server and agent.
So I just had a very simple idea for a temporary solution, and I just had to give it a shot (see #307).
I still think that the best approach to this problem will be by leveraging a shared database, but for now, this might do the trick for some of you.
Basically, all it does it retrieve the endpoint statuses from another remote host before returning the endpoint statuses. You may specify an endpoint prefix to prefix the name of all endpoints coming from a given remote instance with a string. It's an extremely shallow/lazy implementation, but to group endpoint statuses or bypass firewalls, this should do the trick.
remote:
instances:
- endpoint-prefix: "myremoteinstance-"
url: "https://status.example.org/api/v1/endpoints/statuses"
Note that I haven't documented the feature yet, because it's experimental and it may be removed and/or updated.
Anyways, I'd love if some of you could give it a try it and let me know how it works.
One of the issues with #307 is that clicking on the individual endpoint on the UI does not work. In other word, the page for viewing individual endpoints does not work if said endpoint comes from a remote Gatus instance.
I'm going to release this with v4.1.0, but I'm strongly considering getting rid of that implementation, unless somebody is actually using it and finds it helpful.
Thanks @TwiN for your efforts here, even if you're not happy with the implementation, it's a great start!
I would use this to monitor the same group of services from multiple locations, rather than just bypass security. I get that throwing an instance behind a NAT firewall is definitely great, even better is being able to see metrics from different perspectives of the Internet. My old smokeping installs highlight issues that are only evident from certain locations, perhaps caused by a particular ISP.
Monitoring the same set of targets from all offices of an organisation could also be an advantage.
The original problem this commit is trying to solve was secured subnets, and this implementation would require port forwarding or pinholes as you mentioned in a very early comment. On the whole push-pull argument my vote would definitely be push from the agents / remote clients. An api listening on 80/443 on the main instance with a simple static token I reckon. It gets around inbound firewalls.
I have been meaning to try configuring remote gatus instances to use the database on the main instance. I would carefully configure the groups and services on each node. It wouldn't be ideal but it might work for me.
Anyway, I'll try this commit and report back.
Cheers!
Sam
I also have the same or a very similar use case as @nzsambo and I also agree on this being a push mechanism.
I'd also recommend a static token per agent. The configuration should look something like this:
# Doesn't matter whether array with name of agent or dictionary with name as key ;)
agents:
- name: internet-pov
token: abc123
# Would contain a list of endpoints the agent is allowed to receive the configuration for (may also defined as a file on the agent) and update the status from their point of view. If empty will allow all:
endpoints:
- name: some-endpoint-defined-in-endpoints
critical: true # If the agent is unavailable or reports the status for this endpoint as down/degraded this will mark the endpoint on the host as down instead of degraded
This would also require to add a new dimension for an endpoint check result to also contain the pov (agent or host). Maybe an option can be added to define the distribution behaviour with a scope (host
: only the gatus host is allowed, agent
: only the gatus agents are allowed may also be only one agent depending on the multiplier, all
) and multiplier/point of view mode: (single
: only one instance and will default to single if scope is host, all
: all instances in scope)?
But please be aware of a version/api dependency from agent to host. The agent could be the regular gatus container in an agent mode where certain features (like Web UI) are disabled. Also the metrics sent to the host can be removed from the local storage.
This is super helpful @TwiN . It helps in separation of concerns and a single pane of glass for observability. I would suggest enhancing the feature instead.
I don't think this is an edge-case if you think bigger: I am considering a HA setup, but separate from the kubernetes cluster gatus should monitor to prevent entanglement. The proposed server-agent solutions also don't solve that, unless you can distribute the server, at which point you arrive at a design similar to kubernetes: server instance as control plane and agents as nodes.
I want an instance of gatus running inside our servers in Frankfurt, on the same machine as the services to monitor, and another one in the server on our office (similar to previous comment https://github.com/TwiN/gatus/issues/64#issuecomment-1227640031). Both should be visible in a single dashboard, but this dashboard needs to survive failure of any of the (two) nodes! (I can add a third one if that's needed as typically in proper HA)
With HA like this, gatus would become a true enterprise-ready tool.
I think HA and a distributed deployment should be seen as two different things. Both have their challenges. How do you imagine a HA setup to work internally? Same database in the background? Syncing between the instances?
Now that I revisit this, yes, both these concerns can be handled separately (but then also combined), and there is already quite an elaborate issue about HA: https://github.com/TwiN/gatus/issues/176
However, I think the ideal solution would combine both approaches, as you might need multiple instances active simultaneously to query all services, which each should be highly available, so both issues should be considered together.
Some orientation from another project, which also contains some of the ideas I already mentioned: https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html
We have multiple locations, a distrubuted testing agent is essential for us to ensure our services are available from all intranet locations and how they're performing. The inverse of most status pages (where one agent tests to various locations), we have been looking for the opposite forever! Would love to see this!
FYI, I implemented external endpoints not long ago and it can serve as a way to bypass connectivity challenges.
Long story short, rather than Gatus making the requests, you can now make the requests from within your own environments using whatever tool/application you have internally, and then push the results to Gatus.
WARNING: This feature is a candidate for removal in future versions. Please comment on the issue above if you need this feature.
This feature seems useful to me, although the broken click on endpoint is unfortunate
So I'm currently using external endpoints. I have one primary, externally available instance, as well as two local instances. It's polling those instances, and I'm wonder if it would be better to have those push instead.
Additionally, I wonder if pushing to the primary gatus would store the endpoints data in the DB, as primary is using psql, but the other two are storing in memory, so I lose all the data on container restart.
Not sure I should be posting this in here or a new PR. I have 3 remote-instances right now, everything works fine it seems. When a service on the remote-instance fails a check it reports it, all that works good.
I added a endpoint on the master node to do a HTTP check to the API and if that fails then alert me, figured it's the easiest way to see if the remote-instance is alive and well. In my testing if that endpoint fails OR ANY of the remote-instances go offline they disappear from the GUI and only the local endpoint appear. I would expect if one remote-instance is offline the others would still continue to work as nothing is dependent on that.
I looked at the disable-monitoring-lock but that does not change the behavior. The log shows "silently failed to retrieve endpoint statues" and never tries any others, just stops going to anything else in the config.
Is this the correct behavior and can't be a change or looks like a bug? I don't see how that would be useful since then you really have no idea what is done since every single endpoint disappears. Running the latest github download as of June 3rd 2024.
Thanks!
i have the same use case that there's one public instance but i have two secured networks where i want to display the service status on the public instance.
a pull as per the remote
configuration doesn't work because the private instances are not reachable from outside the network and neither is external-endpoints
helpful here, or at least i wasn't able to figure out how to configure a private gatus instance to push results to the external one.
i agree that some kind of server-agent model would be great where the agents are just pushing their results to the main/server instance.
FYI, I implemented external endpoints not long ago and it can serve as a way to bypass connectivity challenges.
Long story short, rather than Gatus making the requests, you can now make the requests from within your own environments using whatever tool/application you have internally, and then push the results to Gatus.
Can Gatus itself act as that internal tool? Basically, just have two instances running and one pushes its results to the other.
I was trying to just push some statuses from one gatus to another via external-enpoint and custom alerts. Unfortunately, it looks like it is not possible to trigger a request for every status check (since currently alerts only trigger on failures after some threshold). But I think it is worth checking if we could modify the alerting mechanism, since it might involve less extra implementation work.
@zubieta that strategy never crossed my mind, but that's a very clever way to simulate remote instances without the pull mechanism
Assume you have Gatus deployed in several security subnets (or zones) to monitor individual services because one single Gatus instance is not able to reach those services (administratively prohibited due to firewall rules, etc.)
But you want one main Gatus instance which is capable of retrieving health information of services from all those other Gatus instances to display them in one unified Gatus Dashboard.
Can we start a short discussion on this?
PS. Thank you for this great project!