elastic / elastic-agent-system-metrics

Apache License 2.0
0 stars 22 forks source link

Investigate merging this repository into go-sysinfo #51

Open cmacknz opened 2 years ago

cmacknz commented 2 years ago

This repository currently contains the CGO dependent code responsible for obtaining system level metrics for the Metricbeat system module and Agent system integration. It was forked out of https://github.com/elastic/elastic-agent-libs because of the CGO dependency. This has left it as an isolated repository with a single maintainer.

Since the system metrics code overlaps functionally with much of the system dependent code in https://github.com/elastic/go-sysinfo it seems appealing to try to merge them together. The go-sysinfo code is much more widely known, merging them will consolidate our OS/system dependent packages, and the system metrics packages should be able to benefit from additional code reviewers that are familiar with system level details.

Let's use this issue to determine what steps we would need to take to move the system metrics code here into the go-sysinfo repository. The biggest barrier will be that the two codebases have totally different APIs, so we will have to identify ways to make the system metrics code conform to the conventions in go-sysinfo or decide where we will allow the two APIs to diverge.

I think finding a common home for our system level code is the best path for long term maintenance of both code bases considering how much they overlap.

CC @andrewkroh @fearful-symmetry

fearful-symmetry commented 2 years ago

Agreed that we need a common home for all the system metrics, the mix of repos we have now is a bit of a mess; I spent a lot of time trying to clean it up, then V2 happened and we got a whole new repo (this one).

I think the best strategy, considering how time-constrained we all are, is to merge everything to one repo now, and then over time slowly try to merge/refactor the libraries to make the code more homogeneous. There's also gosigar, which should be deprecated, but might still have some dangling imports.

Some context might help explain the API differences here: The code in this repo is mostly from beats/libbeat, which was written/refactored with the intention of treating the the linux and system modules as first-class use cases; this is why many of the refactored metrics here, like memory, cpu, process, contain a great deal of logic, while the corresponding system metricsets are largely boilerplate code. The intention here was to cut down on the expensive and messy MapStr formatting functions that transformed system metrics into a structure that elasticsearch expected. At this point, it was decided to treat the code in go-sysinfo as servicing non-reported metrics, like self-monitoring in libbeat, fetching process info, etc.

They're built around two different use cases, and if we want to "standardize" them somewhat, I suspect it'll be easier to convert the go-sysinfo users to the more libbeat-style of metrics, simply due to the exhausting amount of hashmap manipulation that's sometimes required to format system metrics for reporting upstream.

cmacknz commented 2 years ago

It feels like as a generalized system level package it would be more useful if we didn't assume it was being used for libbeat metrics (and with the V2 input project, we may change what this needs to be).

I think there is only a single user of the code in this repository (the metricbeat system module, and maybe some use in agent itself) where as go-sysinfo is used in a few more places. That unfortunately means that if some code is going to see breaking API changes it probably has to be the system metrics code, or at least it may need to be restructured to have a metrics API and something else underneath that better fits the go-sysinfo API.

andrewkroh commented 2 years ago

I looked over the data provided by elastic-agent-system-metrics (I'll call it EASM). I think it would be good address the gaps in go-sysinfo by incorporating features from EASM. My recommendation would to implement small pieces at a time in go-sysinfo, update EASM to use the new code, and repeat until there's very little logic left in EASM.

From an API perspective I think go-sysinfo should continue to provide only raw values. If Agent or Beats need to compute deltas between samples (i.e. datapoint = current - last) then that logic should live outside of go-sysinfo. The same for other derivative values like CPU percentages.

go-sysinfo will also need to implement https://github.com/elastic/go-sysinfo/issues/12.

fearful-symmetry commented 2 years ago

Jumping off of what @andrewkroh said, the biggest issue here is the amount of code duplication, and ever-present temptation to blur the line of what public APIs should go where.

I'm still kind of leaning towards merging these two libraries somehow. The two libraries are always going to be deeply interlinked, and keeping them in separate repos will add a lot of development headaches.

cmacknz commented 2 years ago

Agreed, thanks for the detailed notes Andrew. We should have one system level Go repository. Let's commit to a slow migration of this code into go-sysinfo on a piece by piece basis as we have time.