cholcombe973 / rusix

Distributed filesystem in Rust
Other
54 stars 11 forks source link

Management interface #2

Open cholcombe973 opened 6 years ago

cholcombe973 commented 6 years ago

How should the site reliability engineer(SRE) be interfacing with this filesystem? SRE's generally like metrics to be exported so that external systems can track the health of the cluster. CLI is usually also pretty high on the list as well as a REST interface for people who are more DIY oriented.

garypen commented 6 years ago

SRE (general use case)

In general, an SRE is going to use whichever management facilities are provided by his/her platform and almost certainly doesn't want to learn new solutions (unless they provide massive benefits). For instance, Prometheus is a popular choice for kubernetes. Many other "log consuming" solutions exist: splunk, etc..., which all work in more of less the same way.

I think the best way to interact with these systems is to provide logging facilities which follow the "standards" in this area:

and then rely on tools such as Prometheus, Splunk, etc.. to provide health monitoring, alerts, etc.. based on the log contents.

Logging

We could use the "log" crate (https://github.com/rust-lang-nursery/log) as a facade over our chosen logging facility. My current preference would be to use "log4rs", but with the protection of the facade we could change that decision later if required as the logging space evolves.

An alternative is "slog", which has widespread use.

CLI/REST

I like the approach adopted by many systems nowadays of writing an API (RESTful, ...) that supports management/configuration of a system and then writing a client that exercises that API and exposes most (all?) of the features.

Good examples include: kubectl (kubernetes), openstack (OpenStack), etc...

We should be doing something like this: rusixctl... ?

Deciding which web framework (if any) to adopt to implement the RESTful interface will be tricky. There are many candidates (warp, tower-web (soon), actix-web, conduit, gotham, etc...) and this is a space that is evolving rapidly. I like the look of warp, but it's very new. I also like gotham and actix-web. Any preferences?

cholcombe973 commented 6 years ago

Wow yeah I agree setting up a REST API and then letting people do their thing would be best. I'm actually not familiar with warp or tower-web. I've created a few things with rocket though and that I was nice. I'm generally agreeing with everything you're saying here and I don't have a preference for a web-framework. Rocket works well but it requires nightly which is kind of a pain sometimes. Anything that stays on stable would be really nice. I've been kinda leaning away from json lately just because it's ambiguous but I don't have a problem if you want to use it.

So far I've only really used the log crate. I gave slog a try awhile back and it was alright. I didn't feel like I gained enough from it to prefer it though. I agree making logging configurable in terms of destination and level is best. I have a few examples of doing that with the clap crate and it's super easy.

jcgruenhage commented 6 years ago

I don't necessarily agree on generating metrics from logs, having a built in metrics endpoint to be scraped externally instead is a lot better IMO.

For logging, the log crate is definitely the way to go, with a backend like env_logger or fern.

For choosing a web framework, I'd currently go with actix-web. It has been really nice to use when I tried it, works on stable and is still progressing. Warp also looks nice, but a bit basic, and tower-web is not there yet.

jcgruenhage commented 6 years ago

For metrics, the tic crate looks like it would be a good fit.

garypen commented 6 years ago

Looks like we are in agreement about logging: use log crate and then choose appropriate back-end. I don't have strong feelings about that so happy to go with suggestions. For web-framework: actix-web is a good choice and I'm happy to go with that.

For metrics: I still think it seems like an unnecessary thing to be considering. Mainly because we are going to do logging anyway and various ETL stacks do a good job of handling log data. e.g. grafana provides great visualisations from log data. What additional functionality would something like "tic" be providing?

jcgruenhage commented 6 years ago

For metrics data from logging to be useable, we'd need to log every event relevant to the metrics we want to have. If we use a different thing, dedicated to metrics, that means we don't have to spam the logs with events that we probably don't care about and that make the logs less human readable.

Tic presents those metrics on an http endpoint, where external scrapers (something like Prometheus for example) can get them. Those scrapers could then be used as a data source for Grafana and the like.

garypen commented 6 years ago

Lots of applications do log many metrics nowadays because the consumers of the data don't want to deal with multiple log sources. However, I agree with you that "spamming" the logs is a problem.

Ok, I'm happy to decide to keep metric data separate from other logged data. I'm happy with the choice of tic.

Can we close this and note that:

web-framework: actix-web metrics: REST endpoint/tic logging: log create + back-end logger (to be decided later). ?

jcgruenhage commented 6 years ago

I'd be fine with those three, yes.

Maybe having an admin interface outside of that that shows a few basic things without having a monitoring/metrics stack (Prometheus + AlertManager + Grafana) would still be useful? List of storage devices with usage/health, replication/erasure coding settings/health, cluster capacity and usage, things like that.

cholcombe973 commented 6 years ago

Yup I'm fine with those 3 as well

On Wed, Aug 29, 2018 at 12:17 PM Jan Christian Grünhage < notifications@github.com> wrote:

I'd be fine with those three, yes.

Maybe having an admin interface outside of that that shows a few basic things without having a monitoring/metrics stack (Prometheus + AlertManager

  • Grafana) would still be useful? List of storage devices with usage/health, replication/erasure coding settings/health, cluster capacity and usage, things like that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cholcombe973/rusix/issues/2#issuecomment-417072643, or mute the thread https://github.com/notifications/unsubscribe-auth/AC6qE_k8EKewzBIwkJxxn8WmMYbGcBMOks5uVui_gaJpZM4WCukp .