flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Discussion for Prometheus plugin (example use case autoscaling) #5214

Closed vsoch closed 1 year ago

vsoch commented 1 year ago

We want to develop a flux plugin that is able to deliver metrics to a server (likely Prometheus and using https://github.com/jupp0r/prometheus-cpp) that can then be sent to prometheus and the horizontal pod scaler adapter. In layman's terms, when the Flux queue gets too big and needs more resources, it can tell the autoscaling plugin and get them, and shrink back down the same. Likely we'd want the plugin build (outside of or alongside flux?) and then loaded in an rc file, like modload 0 prometheus. Also note that prometheus is interesting to use for other cases outside of autoscaling. The original discussion started here: https://github.com/flux-framework/flux-core/pull/5184#issuecomment-1565737130

Questions I have:

From @garlick

It depends on what kind of plugin is needed. I would think resource utilization or queue length or something like that would be the sort of metric you'd want for autoscaling? (Could we move this to the autoscaling discussion?)

I think to try we would just want to get the current queue stats - jobs that are running (using what resources) and jobs in the queue (and what resources are needed). I think I'd probably start with the most basic of things - number of jobs in the queue, maybe in different states, and then add details to that about resources needed vs. being used.

If I can get enough inkling of how to start, this would be fun for me to try.

Update: using https://github.com/digitalocean/prometheus-client-c/ doh, I can't use a c++ library!

vsoch commented 1 year ago

okay I think I see how this works. Let me take a first shot (don't give me any hints) and I'll follow up with questions!

garlick commented 1 year ago

A suggestion to keep things simple and moving forward would be to write a python script that can be run periodically with flux cron to poll for the stats you want via the python api and then if there are python bindings for the cloud scaling stuff, push data there?

vsoch commented 1 year ago

A suggestion to keep things simple and moving forward would be to write a python script that can be run periodically with flux cron to poll for the stats you want via the python api and then if there are python bindings for the cloud scaling stuff, push data there?

I totally get simplicity is good, but Python is really slow compared to C, and I want to do this in the way where it will actually work optimally (and I'll learn something). I'm doing fairly well - I have an entire setup where I've added the module (conditional on a flag) to the automake setup, and I have a skeleton module that builds that I can now add to. I created a "skeleton module" and it just fully compiled, including my dependency for prometheus! I'm also writing up some "how to plugin" docs that can eventually go into our main docs:

https://gist.github.com/vsoch/756f10b52f7889e1b781ccdc599fa8cc

That's just the start of the guide, and I can push the actual code to a branch and open a PR to show you, but I think I want to actually try adding some of the prometheus logic first (this is the part where I actually have to write code in C, uhoh)!

vsoch commented 1 year ago

See https://github.com/flux-framework/flux-core/pull/5215 for all my terrible efforts and a few questions I have :laughing: I'm going to try building this into a container with a custom rc1 file next to see if (when the plugin is built with flux core) I can see it run with those module load commands. My guess is that main is just called when it's loaded, and for interaction with Flux I need to sprinkle things elsewhere. And I think ideally I want it to kick on when the broker starts, so probably will need to dig in a bit more to figure out how that works.

And given the last thing is true, probably you are panicking because you don't want feature bloat added to Flux, and for that, first I agree, and say don't worry about some kind of hurt feelings if this doesn't ever make it in. Sometimes the important part is just the learning and proposing the idea to think more about. But on the other hand, I'd argue if we can do this right, Prometheus is a really useful thing to have. It's what the lab should be trying to use for monitoring instead of rolling their own solutions, but I won't go down that path because I know about the stepping on toes thing. But for our scoped use cases, it would be hugely useful for Flux in a more modern setup (e.g., Kubernetes) not just for autoscaling, but also for general monitoring.

So TLDR: I really want to get this fully working, and explore adding to Flux as much as is feasible. If it's a hard no / run away / do not pass Go and collect $200 we we can fall back to other ideas. I'm not sure cron is best suited because it does require a server sort of thing. Anyway - still a TON more to learn so we aren't ready for that part of the discussion yet! I'm having fun so I am pretty pumped to even get this idea started.

vsoch commented 1 year ago

okay going to leave some notes here for learning - I'm trying to understand how modules work.

okay - so working theory is that we are registering some kind of service, and these reactor things might be involved, and there is some kind of communication between flux and that socket and the module code I write (that Flux knows as a shared library *.so) that (would hopefully?) allow me to regularly get stats. Need to test more stuff out to figure out more.

garlick commented 1 year ago

There's a summary of different component/plugin types here. Broker modules can be done in external projects (fluxion is an example).

A broker module is useful for implementing Flux services (e.g. a Flux RPC target) that run for the life of the Flux instance as the instance owner and are doing enough work to warrant their own thread of control. Since there can be an independent copy optionally loaded in every broker, broker modules can implement distributed services that leverage the TBON structure of the overlay network. Because they run in the broker address space, we have to be very sure they don't leak memory or segfault. For that reason alone, they are not the path we usually suggest for extending Flux if there are other ways to get the job done.

I think you're right that we have a documentation gap and it would be good to work through that with you, although it's not clear to me that this particular problem warrants a broker module, nor if it's appropriate for flux core.

Some misc info on broker modules

I apologize in advance for the state of the docs, and also for other dodgyness you may encounter as the broker module design and code are some of the oldest parts of flux.

Edit: s/broker plugin/broker module/

grondo commented 1 year ago

I think it could help if we stick to the terminology from that document for clarity. For this particular case, it sounds like we're talking about a broker module, not a plugin.

Also, I read through this issue and there is no description of what prometheus is or what it needs to do. Sorry for my ignorance, but in order to discuss the design of a promethus module, we'd need to know exactly what the module would be doing. If it is generating "metrics", then as @garlick pointed out, a broker module might not be the right solution, since a module has its own thread of control so it won't be able to easily generate any metrics except about itself. (I'm not sure how we're thinking of gathering these metrics from a broker module, but I'm sorry if I missed that part of the discussion!)

For low-level metrics, we already have prototype support for StatsD collection (see https://github.com/flux-framework/flux-core/blob/master/src/common/libflux/stats.h), There's an example of intended usage in the content-cache here

https://github.com/flux-framework/flux-core/blob/c2964c9a832559139cc56b5c541b6b92dfaafefe/src/broker/content-cache.c#L919-L931

A quick google search shows that there already exist some StatsD->Prometheus gateways, e.g. https://github.com/prometheus/statsd_exporter. We could add some test stats exports in the job-manager, possibly as a jobtap plugin at first, and you could potentially run the Promethus exporter when running in Kubernetes.

If that idea definitely won't work, then rather than have a parallel implementation of a similar stats gathering service in flux-core, perhaps we can look at turning the "fripp" backend for stats collection into a plugin itself. Then the statsd support could easily be swapped out for the stats collection service du jour, allowing new stats implementations to take advantage of any instrumentation we've already added to Flux.

Finally, I'd like to explore further the idea of running a python service alongside Flux that collects stats and sends them to Prometheus. This could work if we're just sampling values (like the queue stats you propose above) as opposed to generating a metric on every event. If you keep the python service running, then the abysmally slow python startup cost doesn't matter, and sending queries to several services every second or even smaller interval should work fine. This nice thing about this approach is that it would be much quicker to implement and wouldn't require core changes like the other solutions discussed above.

However, I'm not sure if it would actually work for Promethus since I don't know its requirements, but thought I'd throw the idea out there.

garlick commented 1 year ago

@vsoch can fill in details but I think the idea is to implement a mechanism to scale a flux instance running in the cloud up or down based on its demand for resources. I would guess prometheus is a way to abstract the collection of metrics from a particular cloud service scaling implementation.

It seems like a good first step would be to identify the metrics of interest and the specific mechanisms available for collecting them in one place? Is this really a high throughput data collection situation? Seems like more coarse grained, slowly changing numbers are what we're probably looking at, since scaling up or down will likely be a heavy weight operation that we're not going to do at a high frequency.

grondo commented 1 year ago

Ok. But I'd still hate to have multiple stats collection mechanisms for each "use case" popping up in flux-core.

However, I'll just leave it at that and step back.

garlick commented 1 year ago

Ok. But I'd still hate to have multiple stats collection mechanisms for each "use case" popping up in flux-core.

I agree. More details on requirements are needed is my main point.

vsoch commented 1 year ago

Great discussion! Let me try to catch up.

I apologize in advance for the state of the docs, and also for other dodgyness you may encounter as the broker module design and code are some of the oldest parts of flux.

Totally OK - I thought for part of this I could at least write a guide for making a broker module (and I'll try use that terminology).

A broker module is useful for implementing Flux services (e.g. a Flux RPC target) that run for the life of the Flux instance as the instance owner and are doing enough work to warrant their own thread of control. Since there can be an independent copy optionally loaded in every broker, broker modules can implement distributed services that leverage the TBON structure of the overlay network. Because they run in the broker address space, we have to be very sure they don't leak memory or segfault. For that reason alone, they are not the path we usually suggest for extending Flux if there are other ways to get the job done.

Indeed this broker module would be running a service on a port, and actually I added a timer watcher last night, and thought I would be closer to getting it working but it segfaulted (and the flux start instance exited). For the autoscaler use case, we would only want it running on the top level broker to send back metrics to the autoscaler about all the jobs in all nested instances.

Also, I read through this issue and there is no description of what prometheus is or what it needs to do. Sorry for my ignorance, but in order to discuss the design of a promethus module, we'd need to know exactly what the module would be doing.

Sorry I had put that in the pull request itself. "Prometheus is a metrics server and time series database that makes it easy to monitor applications." It's the bread and butter of monitoring for cloud native. If I had to choose one monitoring tool from the entire universe, this would be it. It's open source and used by many companies and academic institutions, more detail here.

If it is generating "metrics", then as @garlick pointed out, a broker module might not be the right solution, since a module has its own thread of control so it won't be able to easily generate any metrics except about itself. (I'm not sure how we're thinking of gathering these metrics from a broker module, but I'm sorry if I missed that part of the discussion!)

I saw that, and was assuming it would be running on the leader broker, at the top of the entire tree. But the flexibility of running under a broker would allow someone to, for example, do a study that measures different needs for different parts of a workflow.

For low-level metrics, we already have prototype support for StatsD collection (see https://github.com/flux-framework/flux-core/blob/master/src/common/libflux/stats.h), There's an example of intended usage in the content-cache here

I did find statsd early on (before doing any work on this) but I didn't want to extend/use it because of the complexity. For any use case (and especially the operator) I don't want to require running several external services that have to then work nicely with Flux. I want something that is built into Flux that works with a module load or broker flag, and that's it. The implementation I'm working on runs the service from Flux and provides the endpoint directly without the added complexity of external services. Prometheus as a metrics exporter is also much more popular than statsd. In summary (and if you Google search to compare these two there are lots of articles) my preference is for Prometheus, and directly (a layer of Prometheus on top of statsd is an even worse idea!) because:

There is some FAQ here and a comparison table. Also if you search around, autoscaling + Prometheus is a well-trodden path. There are examples and guides up the whazoo. I can't find much of anything for graphite / statsd. That alone says something.

A quick google search shows that there already exist some StatsD->Prometheus gateways, e.g. https://github.com/prometheus/statsd_exporter. We could add some test stats exports in the job-manager, possibly as a jobtap plugin at first, and you could potentially run the Promethus exporter when running in Kubernetes.

This would require the already complex dependencies for statsd (that graphite service) PLUS an additional sidecar? I really think that's too much.

If that idea definitely won't work, then rather than have a parallel implementation of a similar stats gathering service in flux-core, perhaps we can look at turning the "fripp" backend for stats collection into a plugin itself. Then the statsd support could easily be swapped out for the stats collection service du jour, allowing new stats implementations to take advantage of any instrumentation we've already added to Flux.

I don't know what fripp is, but if we could use fripp + (some stats exporter like prometheus that doesn't require umpteen extra services) that would work for me!

Finally, I'd like to explore further the idea of running a python service alongside Flux that collects stats and sends them to Prometheus.

It's the other way around - a Prometheus exporter uses a pull model, so it's the metrics server that would request stats from it. This is why some kind of cron job wouldn't work, because I'd want a direct request that gets the current stats at the time, not stats from 15 seconds ago.

This could work if we're just sampling values (like the queue stats you propose above) as opposed to generating a metric on every event.

We really only need an endpoint that when pinged provides the current stats. Watching the autoscaler/v1, it would do a check between 15-30 seconds (eyeballing it).

If you keep the python service running, then the abysmally slow python startup cost doesn't matter, and sending queries to several services every second or even smaller interval should work fine. This nice thing about this approach is that it would be much quicker to implement and wouldn't require core changes like the other solutions discussed above.

Yeah, I was just hoping to have something be a part of Flux As an example, the Python Restful API works OK, but it's not as fast or reliable as I would want it. I also need to clone and install a separate repository to use it. I really like the ability to just add a broker flag, and get it out of the box.

However, I'm not sure if it would actually work for Promethus since I don't know its requirements, but thought I'd throw the idea out there.

I guess it could, but it's a bit janky. But I'm not a flux core dev so I'll work with whatever you suggest. We should probably also get input from others on my converged computing team for what they think, because they have experience / insights to add, so OK with me if we wait for them to get back mid-week after the holiday.

But in the meantime - I'm kind of having fun working on this! :laughing: I'm debugging a segfault with my current design (I added a timer reactor) and might follow up with some questions if I can't figure it out. I'm learning a ton so even if we don't follow this path I appreciate the help!

vsoch commented 1 year ago

Also look over there... is that a pig flying? :pig: :pig2: :pig_nose: I never thought this could be possible but I wrote a ton of C yesterday and I liked it!! mic drop :microphone: The errors weren't somehow so terrible to debug, and the language is starting to look more familiar and comfortable, as opposed to like gobbelty gook. TLDR I think I might finally be picking it up a bit.

My one complaint with most of the C code I've looked at (well yesterday) is that it's not always well documented. I think I tend to fall on the other side of the documentation spectrum, adding comments where they really aren't needed, but (at least for learning) I would definitely appreciate more comments that describe what lines / chunks of code are doing.

vsoch commented 1 year ago

Maybe you can give some hints - my current design is here https://gist.github.com/vsoch/f2a65acba2cde9e0063c6577ff81f5f1 and it is segfaulting when I try to map the args back into the context (this part).

struct prom_ctx *ctx = arg;

I thought this might be related to trying to put the prom structs / objects on it so I tried removing them and it still segfaults, so I think I'm probably just doing something stupid. It didn't seem like such a crazy idea because the heartbeat module does it too (what is that used for btw?)

vsoch commented 1 year ago

I'm going to try a different way - subscribing to the heartbeat event https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man1/flux-event.html instead of trying to make another timer.

vsoch commented 1 year ago

okay following your suggestions, I've created a Python client for Prometheus + Flux: https://github.com/converged-computing/prometheus-flux that exports the metrics in the same way using starlette + uvicorn. It would need to be run alongside the flux instance in some kind of subprocess (and since this isn't the restful API if I use it for the start command I could not access my cluster, so likely I need to manually shell in to the broker and start it). This is not ideal, but probably OK for testing. I'm hoping we can figure out a more production setup, whenever folks are available to flush this discussion out!

But in the meantime I'm going to test this out to see how it works (and how to enable autoscaler/v2).

grondo commented 1 year ago

It would need to be run alongside the flux instance in some kind of subprocess (and since this isn't the restful API if I use it for the start command I could not access my cluster, so likely I need to manually shell in to the broker and start it).

Could you extend the restful server to run prometheus-flux as a subprocess and terminate it when exiting?

We don't have a general solution for running processes "in the background" in Flux. Some tangentially related discussion here.

The other thing you could do is run it as a job with the alloc-bypass plugin so it doesn't take any resources, but that's a bit kludgy too at the moment. Running your script in the background of the initial program (argument to flux start) is probably the easiest solution for now.

vsoch commented 1 year ago

I'm still chewing on this design a bit - I think after 2 days of work on this (and now adding to the operator and seeing how it works) I want to try actually removing the need for Prometheus entirely. Now that I can see the calls it's making, I'm seeing that at the end of the day, the autoscaler has custom metrics defined, and it's looking for a custom metrics API to get them from. That is normally some kind of service alonside a cluster, and the reason folks like Prometheus is that it separates the application logic from that service, and would give you a view across the cluster. But I don't really need that, and in fact we can get all of our metrics about the entire cluster from one pod (leader broker). So I'm going to step back and see if I can design an api-service (that would be run alongside the broker) that provides these same metrics. I have no idea if I can do that / how to do that but if I can remove the need for Prometheus (it gets deployed as another operator) that would be really cool.

vsoch commented 1 year ago

This would actually be a super cool use case to have an ability to easily connect to multiple flux leader brokers. E.g., the kind of queries for the API that I'm putting together look like:

/apis/custom.metrics.k8s.io/v1beta2/namespaces/<namespace>/metrics/<metric_name>

That design is intended to have a cluster-wide metrics API where you can request a metric from a specific namespace. If I'm running this inside a namespace (what I'm starting with trying out, since I'll be running it from inside the cluster) I'd just verify the requested namespaces matches the actual namespace, but you can imagine this API server running in the default namespace, and with permission to interact with pods in the other namespaces (each with their own Flux MiniCluster), and then it could make some call to any requested remote broker (e.g., a flux.Flux() handle for it) and then calculate the metric, and go away.

The use cases I think are subtly different. The hack in one namespace will work for most of what we want to do, but if we ever have multiple users (and multiple namespaces and MiniClusters) we probably want some kind of separation logic.

Anyhoo just thinking... back to starlette! :star:

vsoch commented 1 year ago

I got everything working with my custom metrics API, and without Prometheus. Thanks for the discussion, and please ping me if anyone comes around wanting Flux + Prometheus because we now have two ways to do that.