Closed byrnedo closed 5 years ago
Sorry for the slow reply and thanks for getting in touch!
Right now the container bundles Prometheus, the integration binary for Swarm, container and host monitoring.
The combination might not be robust, e.g. if the binary I made dies, it will not be restarted.
In the long run I'd like to separate these two concerns and present a Consul-compatible API so stock Prometheus images work with this out of the box.
I'm using this in dev and it works, but I'm not yet running this in production and would recommend against it if your monitoring availability is critical.
On Apr 28, 2017 10:10 AM, "Donal Byrne" notifications@github.com wrote:
Hi, Nice project, I've been looking around to find something like this while prometheus integrate support. Just wondering: is this being used on an active basis? Is it reliable?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/function61/prometheus-docker-swarm/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmdh3kP5dynpjcO1V75hTTcjQaGzVjvks5r0ZDfgaJpZM4NLIWn .
Sorry for the slow reply @joonas-fi , a consul api would be the business since one could technically use other projects for that. For instance, using consul as a source for templating load balancers.
Would you want a hand doing any of this? I see docker swarm mode now supports swarm wide events, so that should remove the need for the polling at least.
I made a bit of stub attempt at handling events: https://github.com/byrnedo/prometheus-docker-swarm/tree/feature/events.
I'm slow sometimes as well - in many different meanings.. :) I am currently swamped with so many things that this project is neglected, even though I think it has promise.
Oh yes, I'd very much like to have some help. Glad you like the idea of this project impersonating Consul Catalog API! I think it would rock because there'd be no changes to Prometheus, and we wouldn't have to bake this bridge binary in the Prometheus image (so the containers' lifecycles can be managed separately => more reliable). And like you said, other projects could feed off of the API as well.
I took a look at the Consul API implementation before, and it seemed like there are some tricks like long-polling etc., so I didn't start implementation right away because it seemed like a large task, and of course this is so exotic idea that there probably isn't a library for it (API client libraries yes, API server libraries? probably not..?)
As to your contribution, thanks and that's awesome! It would be cool to have it hook up to the events, because polling is always sucky. Do the events concern the whole cluster, or only that one Swarm node? I don't recall the exact issue, but I remember having some issue where some cluster-wide API didn't return the IP for that container..
I took a look at your "v2" branch, good to see you added Glide (version pinning is always awesome). I'm a bit bummed that you removed container/host metrics but I can understand why (and you explained that we can use Cadvisor for that). I myself felt dirty enabling host/container metrics gathering by default, and I thought that perhaps it should be made an opt-in feature by ENV variable or something. I investigated Cadvisor before and was a bit turned off by how feature-packed it is and how many metrics it spits out. Basically since I was integrating with Docker API anyways, I liked the fact that I could export only the basic metrics I liked and not more, and the fact that I don't have to operate additional infrastructure for metrics, since all these moving parts make up the metrics and we already have:
Anyways, I understand your wanting to remove that feature. Perhaps we could have a lightweight plugin model (or a feature switch) that could enable container/host metrics gathering, so it is not forced on by default. Or I could just do that part as an entire different project, since it's really not a core feature of what this thing is supposed to do (though most people want to have host/container metrics and I liked to get it as easy as possible)
Overall, if we can agree for a higher level vision for this project, I would gladly make you a maintainer, since I fear I'm not able to nurture this project enough alone anyways.
No worries at all, I'm the same, just this would kill two birds for me if it could do the consul api (I was using consul to template an lb but since I moved to swarm mode I've done away with it and am just relying on static dns entries :/ )
Yeah, that long polling scares me a bit. Perhaps it's too ambitious but I think I'll have a stab at it. I'll do some reasearch and see what endpoints prometheus wants.
The events are now swarm-wide but I end up getting the ip by doing an inspect on the task and seems legit so far (testing on command line scaling up and down).
Yeah, the metrics could totally come back, I mostly removed it to make the context more manageable for myself, bit selfish haha. I get what you mean by cadvisor being heavy, just thought that it was so widely used that there was no point in trying to be a poor-mans version.
I'd happily take over maintainer role for now!
You wanna make another ticket where we can discuss some goals? I feel I've hijacked my own issue!
Sure, I opened #2
I am currently running this in production with great results. Closing this issue because #2 is open.
Hi, Nice project, I've been looking around to find something like this while prometheus integrate support. Just wondering: is this being used on an active basis? Is it reliable?
Also, is the config file that is written templateable? I'd like to check that the correct number of instances in a service are healthy for example, not necessarily scrape the instances.