Enable Proxy Health Monitoring

t-lin commented 4 years ago

Discussed in the May 21st meeting, we want to figure out a way to scrape metrics from proxies so we can potentially monitor its health and connection status.

Recap of current issue: Proxies are containerized.

Currently, our containers are sharing the same network as the underlying host, so we can expose the proxy's HTTP endpoint via a separate port. However, this requires some type of discovery service to automatically find these ports and update Prometheus' configuration for each container.
If we ever move to using container-only neetwork namespaces in the future, the proxies will be behind a NAT. We could still expose the endpoint via Docker's port mapping capabilities, but we run into the same issue of needing a discovery service.

t-lin commented 4 years ago

One possible solution (discussed during the meeting)

Use a node-local PushGateway instance that all proxies on the node will push to. Prometheus then scrapes from it.
However, there are pros/cons to this solution (discussed here: https://prometheus.io/docs/practices/pushing/)
- PushGateway becomes potential bottleneck and central point-of-failure
- PushGateway keeps metrics and data for all time, so deleting metrics require deleting them in the PushGateway (easy to do, but increases complexity of system logic)
- Prometheus won't know if proxy went down or if its metrics just aren't changing

t-lin commented 4 years ago

@hivanco @michaelweiyuzhao Would appreciate any thoughts, input, or alternative ideas you guys may have.

A possible alternative:

Find proxies via a p2p hash (nodes can advertise more than 1 hash) and let Prometheus scrape them over a p2p protocol. This method is similar to libp2p's HTTP proxy example where they tunnel HTTP requests over p2p.
Would likely have to create a separate "monitoring" protocol ID and handler.
Must be able to keep track of and query all proxy instances that exists in the system.

t-lin commented 4 years ago

May be able to base on: https://github.com/libp2p/go-libp2p-examples/tree/master/http-proxy

t-lin commented 4 years ago

@hivanco @andrew-cc-chen Following up from yesterday's discussion. If a push-based model was used (i.e. using Pushgateways with each one wrapped in a p2p node, and proxies push to the nearest), then Prometheus still needs to first find the gateways via their hash and then pull from them. So it's just an extra level of indirection between Prometheus' client proxy and the service proxies? A 3-level hierarchy vs 2-level?

An obvious pro of this approach could be to achieve scalability (e.g. each region, or perhaps each set of N servers, has a local Pushgateway, and a top-level Prom collects it all), but I think that can be achieved by Prometheus' built-in federation feature (see: https://prometheus.io/docs/prometheus/latest/federation/). What are other possible pros/cons to be considered?

andrew-cc-chen commented 4 years ago

Right now the 3 level hierarchy looks like this,

proxy --- p2p stream (push) ---> central node --- http (pull) ---> prometheus

For the 2 level hierarchy, are you saying that prometheus should just pull straight from the proxies?

t-lin commented 4 years ago

For the 2 level hierarchy, are you saying that prometheus should just pull straight from the proxies?

Yeah... isn't that what the whole discussion the other day was about? Whether or not to do that, and what the trade-offs between the two models are?

proxy --- p2p stream (push) ---> central node --- http (pull) ---> prometheus

Would the middle component truly be a centralized node? Or just one of many possible nodes? If it's truly central, then doesn't that implicate a scalability issue?

andrew-cc-chen commented 4 years ago

Would the middle component truly be a centralized node? Or just one of many possible nodes? If it's truly central, then doesn't that implicate a scalability issue?

Central node as in one of many possible nodes

t-lin commented 4 years ago

Central node as in one of many possible nodes

So this goes back to a similar problem, no? How would Prometheus be aware of (discover) these nodes in order to pull from them? Unless you manually update the Prometheus config file for each node created?

andrew-cc-chen commented 4 years ago

I was thinking for now we could just manually update the Prometheus config file for each newly created node.

andrew-cc-chen commented 4 years ago

@t-lin I was wondering if there's a way to check if a connection between two nodes are active? Right now every 30 sec, the collector node finds peers and attempts to create a new stream for each peer. It's fine for the first cycle, but during the second cycle I have no way to check if a connection is active unless I map each peer to a flag.

t-lin commented 4 years ago

@andrew-cc-chen See https://github.com/Multi-Tier-Cloud/common/blob/cfbf72af7985e179c389a59ca772ef06f8c2ed2a/p2pnode/p2pnode.go#L278 for an example

andrew-cc-chen commented 4 years ago

Sorry I meant to say if there's a way to check if a stream between two nodes is active. Connectedness(...) seemes to check if a connection is possible between two nodes.

ConnsToPeer(...) seems to be the way to check, I'll give that a try

t-lin commented 4 years ago

Connectedness() can tell you if you are connected or not. See: https://godoc.org/github.com/libp2p/go-libp2p-core/network#Connectedness

Note a single Conn object represents an actual connection. You can have one or more Stream objects, that represents virtual connections, on top and hence, the cost of opening a new Stream is quite low.

You should know if a given stream is still open or not, since you control both the dialing side and the receiver side (in the handler function). If you don't close both ends, then it should still be open so long as the underlying connection is still open.

t-lin commented 4 years ago

ConnsToPeer(...) seems to be the way to check, I'll give that a try

ConnsToPeer() returns a slice of Conns open to a given peer, it doesn't tell you whether a given stream is open. See above for the distinction.

All you need is a single connection to be open, you generally don't need more. Multiple connections simply enable nodes that speak multiple protocols (e.g. TCP, QUIC, WebRTC, etc.) or nodes that have multiple interfaces (e.g. wired interface and a wired interface) to have redundancy.

andrew-cc-chen commented 4 years ago

Right now my working solution to check for existing stream with a specific protocol is something like:

for connection in node.Host.Network.ConnsToPeer(peer.ID) {
    for stream in connection.GetStreams() {
        if stream.Protocol() == ProxyMonitorProtocolID
            skip this node
    }
}

Please let me know if there's a better solution

t-lin commented 4 years ago

I think I'm lacking some context here. Do you want to hop on a zoom chat and you can show me what you're trying to do? I'll open up our regular meeting room.

andrew-cc-chen commented 4 years ago

@t-lin Should I be adding all the metrics exposed by the promhttp handler as a GaugeVec (e.g. go_memstats_alloc_bytes, go_threads...etc)? Or just go_goroutines?

t-lin commented 4 years ago

Yes, ideally you should be passing all metrics. You never know what new metrics will be exposed by proxies later on, so you don't want to be modifying your code each time.

andrew-cc-chen commented 4 years ago

I'm currently exposing the metrics with promhttp in proxy, grabbing the metrics using http.DefaultClient.Get (i.e. curl), then sending the metrics as a string over the stream. But I'm lost as to how to efficiently process the metrics. I can't find much info online, but instead of processing the metrics and doing promauto.NewGaugeVec() for each metric with its distinct labels, is there a way to take the http response from promhttp and auto create the gauges?

t-lin commented 4 years ago

Perhaps an alternative would be to pre-process the results in the proxies, and then let the proxies return a simple text response through the stream to the collector? Then when Prometheus scrapes the collector, it just has to provide the set of the text responses.

It's possible to get the metric objects themself in the code within the proxy. See the DefaultGatherer which contains all the default metrics exposed by Prometheus, and its Gather() function.

mwyzhao commented 4 years ago

Quick question, would allocating a new proxy require adding the prom metrics ip and port to the central prometheus.yml in order for it to be scraped?

If that's the case, prometheus directly scraping the proxies seem very impractical to me.

t-lin commented 4 years ago

Quick question, would allocating a new proxy require adding the prom metrics ip and port to the central prometheus.yml in order for it to be scraped?

If that's the case, prometheus directly scraping the proxies seem very impractical to me.

The idea is that Prom should scrape a local P2P proxy (similar to the service proxy in client mode) that discovers and scrapes services proxies through P2P. That way the Prom config file only needs to be configured with a single IP:port.

mwyzhao commented 4 years ago

Is that local proxy already implemented somewhere? I guess that's what Andrew was working on before.

If it's not already implemented I think I'll just get the Allocator to collect the stats for now and update it to use Prometheus in the future. I don't think I have the time today to figure out all of this just today.

t-lin commented 4 years ago

is that local proxy already implemented somewhere?

Unfortunately no. I think Andrew was working on that.

An alternative, similar to what we discussed in our last meeting, is to use the existing allocators on each node as an aggregator. Aggregators would collect info on the containers within the host, and create new (or update existing) metrics for each running container (see the ping monitor for how to create metrics and delete them). The Prom local proxy would only have to discover and scrape the allocators, which would probably help with the issue of scale when trying to discover a lot of service proxies.

mwyzhao commented 4 years ago

I'll just get it working without Prometheus for now, that should be enough to get the entire life cycle working. Once we need the extra metrics I'll look into adding the auto discover mechanism.

mwyzhao commented 4 years ago

Whoops misclicked

mwyzhao commented 4 years ago

@t-lin Funny thing just happened, I was just about to push my changes when I pressed Ctrl-D and logged out of the demo-server-0 vm and now I can't log back in. It says permission denied (publickey) so maybe it got deleted at some point? Would you be able to help me out so I can go on there and push my work.

On another note, the entire life-cycle works now, just one bug left where it doesn't seem to be updating the counter properly so the newly booted containers always die after 1 minute (hard coded threshold). I can look into it later this week or if you get to it before I do please feel free to take a look.

Edit: I seemed to have lost access to demo-server-1 as well, when I try to ssh in it just hangs. nova list says it's still up though.

t-lin commented 4 years ago

@michaelweiyuzhao I'll take a look. I'll have to shut down the VMs to investigate and fix it. Did you happen to change the mod/permissions on SSH-related files in the VMs?

mwyzhao commented 4 years ago

I don't remember messing with them.

t-lin commented 4 years ago

@michaelweiyuzhao It's weird... it looks like some files in your home directory have been deleted. Common files that I'd expect to be there (e.g. .bashrc, .profile, .ssh, .gitconfig) are all gone. Did you maybe run a badly formatted 'rm' command?

So I tried manually re-populating the .ssh files in your home directory so you can log in, and .profile + .bashrc for basic terminal stuff. However, when I start the VM, it seems to be having a lot of issues booting up (lots of errors about certain processes failing to come up), and then it just seems to get stuck in the boot process. I think possibly more files (other than the ones in your home directory) may have been somehow affected or deleted.

I can try and recover your current home directory files and put them into a new VM?

Also, about demo-server-1: Hadi was doing some stuff in the underlying physical server that affected the networking for that VM. I've restored the connectivity to it. I don't think he's responsible for the corruption seen in demo-server-0 though.

mwyzhao commented 4 years ago

Weird lol, I don't remember running any out of the ordinary rm commands. If I did, it's weird that part of it is gone and part of it is still left.

Yea if you can get the service-manager and service-registry files out that would be great. I just need to apply the changes to a commit and push them up.

t-lin commented 4 years ago

@michaelweiyuzhao I've created a new VM. Try accessing 10.11.17.40 and see if it has all the files you need. You'll need to re-install go again (I've copied the installation script into the new VM for you).

mwyzhao commented 4 years ago

Thanks, I'll try it out tomorrow morning.

mwyzhao commented 4 years ago

Pushed the changes and looked into the issue I mentioned yesterday for a bit, looks like the mutex I'm using called tolsrMux is blocking on the first time tolsrMux.Lock() is called, which makes no sense because in the docs it says initialization value is unlocked and in go playground it works fine. I'm also locked out of docker hub on the vm now because apparently I entered my password in wrong twice so I'm going to take a break from looking into it for now.

Did a quick Google search and nothing came up so if you have any idea why it might be blocking please let me know. If you want to take a look at the environment I have a tmux session that you can get into with tmux a.

t-lin commented 4 years ago

Thanks for the quick work!

the mutex I'm using called tolsrMux is blocking on the first time tolsrMux.Lock() is called

Yeah that sounds odd... mutexes should always be unlocked at start. How did you verify it's blocked on the first call? Is it possible another goroutine has the lock? I can possibly help debug it sometime.

mwyzhao commented 4 years ago

There are two places tolsrMux is used, once in the p2p request handler to update a counter and once in an http handler to read from the counter. I have print statements before acquiring and after releasing the lock in both handlers and it only prints the acquire message in the p2p request handler which leads me to think it blocks. The http handler is never run so that shouldn't have any affect on it.

t-lin commented 4 years ago

Hey @michaelweiyuzhao, so I debugged the lifecycle thing a bit tonight. I haven't committed anything, in case you're actively working on it. Here're some suggestions/questions for improvement:

Bug: The containers are currently being killed prematurely (<= 1 min), due to the fact that the ParseInt() returns an error, so containers are always culled whenever the culling function is called. There's a new-line character at the end of the body that needs to be stripped. Replace string(body) with strings.TrimSpace(string(body)).
When services are killed, they're not removed from the lca.services map. So future calls to the cull function always wastes time trying to query metrics from old/dead services.
Why pass node.services and node.servicesMutex separately to functions? Can node simply be passed as a single object?

With the above bug fix, the containers should live at least 1 minute. Actually, it'll be anywhere between [1:01, 2:00] mins depending on when the cull function is called and when the container was spawned. I think the overall logic to cull services can be improved for efficiency and accuracy (I have a few ideas if you're interested).

mwyzhao commented 4 years ago

Gotcha if you haven't already been working on a fix then I'll get started and push a patch. Sure, I'd be interested in hearing your ideas for cull logic.

mwyzhao commented 4 years ago

@t-lin I fixed the issue you mentioned and added a few more log statements for debugging purposes. Unfortunately the issue still exists as the tolsr variable still doesn't seem to update when a request is made to the proxy, as the allocator doesn't see an updated tolsr sent to it from the proxy. I only looked into it briefly so I could be wrong but I don't have time until later to double check.

PhysarumSM / service-manager

Enable Proxy Health Monitoring #25