Open t-lin opened 4 years ago
One possible solution (discussed during the meeting)
@hivanco @michaelweiyuzhao Would appreciate any thoughts, input, or alternative ideas you guys may have.
A possible alternative:
May be able to base on: https://github.com/libp2p/go-libp2p-examples/tree/master/http-proxy
@hivanco @andrew-cc-chen Following up from yesterday's discussion. If a push-based model was used (i.e. using Pushgateways with each one wrapped in a p2p node, and proxies push to the nearest), then Prometheus still needs to first find the gateways via their hash and then pull from them. So it's just an extra level of indirection between Prometheus' client proxy and the service proxies? A 3-level hierarchy vs 2-level?
An obvious pro of this approach could be to achieve scalability (e.g. each region, or perhaps each set of N servers, has a local Pushgateway, and a top-level Prom collects it all), but I think that can be achieved by Prometheus' built-in federation feature (see: https://prometheus.io/docs/prometheus/latest/federation/). What are other possible pros/cons to be considered?
Right now the 3 level hierarchy looks like this,
proxy --- p2p stream (push) ---> central node --- http (pull) ---> prometheus
For the 2 level hierarchy, are you saying that prometheus should just pull straight from the proxies?
For the 2 level hierarchy, are you saying that prometheus should just pull straight from the proxies?
Yeah... isn't that what the whole discussion the other day was about? Whether or not to do that, and what the trade-offs between the two models are?
proxy --- p2p stream (push) ---> central node --- http (pull) ---> prometheus
Would the middle component truly be a centralized node? Or just one of many possible nodes? If it's truly central, then doesn't that implicate a scalability issue?
Would the middle component truly be a centralized node? Or just one of many possible nodes? If it's truly central, then doesn't that implicate a scalability issue?
Central node as in one of many possible nodes
Central node as in one of many possible nodes
So this goes back to a similar problem, no? How would Prometheus be aware of (discover) these nodes in order to pull from them? Unless you manually update the Prometheus config file for each node created?
I was thinking for now we could just manually update the Prometheus config file for each newly created node.
@t-lin I was wondering if there's a way to check if a connection between two nodes are active? Right now every 30 sec, the collector node finds peers and attempts to create a new stream for each peer. It's fine for the first cycle, but during the second cycle I have no way to check if a connection is active unless I map each peer to a flag.
@andrew-cc-chen See https://github.com/Multi-Tier-Cloud/common/blob/cfbf72af7985e179c389a59ca772ef06f8c2ed2a/p2pnode/p2pnode.go#L278 for an example
Sorry I meant to say if there's a way to check if a stream between two nodes is active. Connectedness(...) seemes to check if a connection is possible between two nodes.
ConnsToPeer(...) seems to be the way to check, I'll give that a try
Connectedness()
can tell you if you are connected or not. See: https://godoc.org/github.com/libp2p/go-libp2p-core/network#Connectedness
Note a single Conn
object represents an actual connection. You can have one or more Stream
objects, that represents virtual connections, on top and hence, the cost of opening a new Stream
is quite low.
You should know if a given stream
is still open or not, since you control both the dialing side and the receiver side (in the handler function). If you don't close both ends, then it should still be open so long as the underlying connection is still open.
ConnsToPeer(...) seems to be the way to check, I'll give that a try
ConnsToPeer()
returns a slice of Conn
s open to a given peer, it doesn't tell you whether a given stream
is open. See above for the distinction.
All you need is a single connection to be open, you generally don't need more. Multiple connections simply enable nodes that speak multiple protocols (e.g. TCP, QUIC, WebRTC, etc.) or nodes that have multiple interfaces (e.g. wired interface and a wired interface) to have redundancy.
Right now my working solution to check for existing stream with a specific protocol is something like:
for connection in node.Host.Network.ConnsToPeer(peer.ID) {
for stream in connection.GetStreams() {
if stream.Protocol() == ProxyMonitorProtocolID
skip this node
}
}
Please let me know if there's a better solution
I think I'm lacking some context here. Do you want to hop on a zoom chat and you can show me what you're trying to do? I'll open up our regular meeting room.
@t-lin Should I be adding all the metrics exposed by the promhttp handler as a GaugeVec (e.g. go_memstats_alloc_bytes, go_threads...etc)? Or just go_goroutines?
Yes, ideally you should be passing all metrics. You never know what new metrics will be exposed by proxies later on, so you don't want to be modifying your code each time.
I'm currently exposing the metrics with promhttp in proxy, grabbing the metrics using http.DefaultClient.Get (i.e. curl), then sending the metrics as a string over the stream. But I'm lost as to how to efficiently process the metrics. I can't find much info online, but instead of processing the metrics and doing promauto.NewGaugeVec() for each metric with its distinct labels, is there a way to take the http response from promhttp and auto create the gauges?
Perhaps an alternative would be to pre-process the results in the proxies, and then let the proxies return a simple text response through the stream to the collector? Then when Prometheus scrapes the collector, it just has to provide the set of the text responses.
It's possible to get the metric objects themself in the code within the proxy. See the DefaultGatherer
which contains all the default metrics exposed by Prometheus, and its Gather()
function.
Quick question, would allocating a new proxy require adding the prom metrics ip and port to the central prometheus.yml in order for it to be scraped?
If that's the case, prometheus directly scraping the proxies seem very impractical to me.
Quick question, would allocating a new proxy require adding the prom metrics ip and port to the central prometheus.yml in order for it to be scraped?
If that's the case, prometheus directly scraping the proxies seem very impractical to me.
The idea is that Prom should scrape a local P2P proxy (similar to the service proxy in client mode) that discovers and scrapes services proxies through P2P. That way the Prom config file only needs to be configured with a single IP:port.
Is that local proxy already implemented somewhere? I guess that's what Andrew was working on before.
If it's not already implemented I think I'll just get the Allocator to collect the stats for now and update it to use Prometheus in the future. I don't think I have the time today to figure out all of this just today.
is that local proxy already implemented somewhere?
Unfortunately no. I think Andrew was working on that.
An alternative, similar to what we discussed in our last meeting, is to use the existing allocators on each node as an aggregator. Aggregators would collect info on the containers within the host, and create new (or update existing) metrics for each running container (see the ping monitor for how to create metrics and delete them). The Prom local proxy would only have to discover and scrape the allocators, which would probably help with the issue of scale when trying to discover a lot of service proxies.
I'll just get it working without Prometheus for now, that should be enough to get the entire life cycle working. Once we need the extra metrics I'll look into adding the auto discover mechanism.
Whoops misclicked
@t-lin Funny thing just happened, I was just about to push my changes when I pressed Ctrl-D and logged out of the demo-server-0 vm and now I can't log back in. It says permission denied (publickey)
so maybe it got deleted at some point? Would you be able to help me out so I can go on there and push my work.
On another note, the entire life-cycle works now, just one bug left where it doesn't seem to be updating the counter properly so the newly booted containers always die after 1 minute (hard coded threshold). I can look into it later this week or if you get to it before I do please feel free to take a look.
Edit: I seemed to have lost access to demo-server-1 as well, when I try to ssh in it just hangs. nova list
says it's still up though.
@michaelweiyuzhao I'll take a look. I'll have to shut down the VMs to investigate and fix it. Did you happen to change the mod/permissions on SSH-related files in the VMs?
I don't remember messing with them.
@michaelweiyuzhao It's weird... it looks like some files in your home directory have been deleted. Common files that I'd expect to be there (e.g. .bashrc, .profile, .ssh, .gitconfig) are all gone. Did you maybe run a badly formatted 'rm' command?
So I tried manually re-populating the .ssh files in your home directory so you can log in, and .profile + .bashrc for basic terminal stuff. However, when I start the VM, it seems to be having a lot of issues booting up (lots of errors about certain processes failing to come up), and then it just seems to get stuck in the boot process. I think possibly more files (other than the ones in your home directory) may have been somehow affected or deleted.
I can try and recover your current home directory files and put them into a new VM?
Also, about demo-server-1: Hadi was doing some stuff in the underlying physical server that affected the networking for that VM. I've restored the connectivity to it. I don't think he's responsible for the corruption seen in demo-server-0 though.
Weird lol, I don't remember running any out of the ordinary rm commands. If I did, it's weird that part of it is gone and part of it is still left.
Yea if you can get the service-manager and service-registry files out that would be great. I just need to apply the changes to a commit and push them up.
@michaelweiyuzhao I've created a new VM. Try accessing 10.11.17.40 and see if it has all the files you need. You'll need to re-install go again (I've copied the installation script into the new VM for you).
Thanks, I'll try it out tomorrow morning.
Pushed the changes and looked into the issue I mentioned yesterday for a bit, looks like the mutex I'm using called tolsrMux
is blocking on the first time tolsrMux.Lock()
is called, which makes no sense because in the docs it says initialization value is unlocked and in go playground it works fine. I'm also locked out of docker hub on the vm now because apparently I entered my password in wrong twice so I'm going to take a break from looking into it for now.
Did a quick Google search and nothing came up so if you have any idea why it might be blocking please let me know. If you want to take a look at the environment I have a tmux session that you can get into with tmux a
.
Thanks for the quick work!
the mutex I'm using called
tolsrMux
is blocking on the first timetolsrMux.Lock()
is called
Yeah that sounds odd... mutexes should always be unlocked at start. How did you verify it's blocked on the first call? Is it possible another goroutine has the lock? I can possibly help debug it sometime.
There are two places tolsrMux
is used, once in the p2p request handler to update a counter and once in an http handler to read from the counter. I have print statements before acquiring and after releasing the lock in both handlers and it only prints the acquire message in the p2p request handler which leads me to think it blocks. The http handler is never run so that shouldn't have any affect on it.
Hey @michaelweiyuzhao, so I debugged the lifecycle thing a bit tonight. I haven't committed anything, in case you're actively working on it. Here're some suggestions/questions for improvement:
ParseInt()
returns an error, so containers are always culled whenever the culling function is called. There's a new-line character at the end of the body that needs to be stripped. Replace string(body)
with strings.TrimSpace(string(body))
.lca.services
map. So future calls to the cull function always wastes time trying to query metrics from old/dead services.node.services
and node.servicesMutex
separately to functions? Can node
simply be passed as a single object?With the above bug fix, the containers should live at least 1 minute. Actually, it'll be anywhere between [1:01, 2:00] mins depending on when the cull function is called and when the container was spawned. I think the overall logic to cull services can be improved for efficiency and accuracy (I have a few ideas if you're interested).
Gotcha if you haven't already been working on a fix then I'll get started and push a patch. Sure, I'd be interested in hearing your ideas for cull logic.
@t-lin I fixed the issue you mentioned and added a few more log statements for debugging purposes. Unfortunately the issue still exists as the tolsr
variable still doesn't seem to update when a request is made to the proxy, as the allocator doesn't see an updated tolsr
sent to it from the proxy. I only looked into it briefly so I could be wrong but I don't have time until later to double check.
Discussed in the May 21st meeting, we want to figure out a way to scrape metrics from proxies so we can potentially monitor its health and connection status.
Recap of current issue: Proxies are containerized.