Collect more system information for diagnostics

DaveCTurner commented 2 years ago

When a customer reaches out to Elastic Support, we need to ask useful information to triage and solve their problems. This process implies getting data from the target cluster.

The [support diagnostics utility|https://github.com/elastic/support-diagnostics] is often the starting point of this process. Our recent discussions about supportability identified that exposing more system information (performance metrics emulating vmstat, iostat, and other tools) would help in speeding up the support flow and in reducing the MTTR. The proposed approach is to improve the existing tool to collect more raw files from the disk, and then analyze them offline.

This is easier than reimplementing all the analysis logic into Elasticsearch, and it would allow to collect more information that Elasticsearch doesn’t have direct access to because of security constraints that we don’t want to relax.

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

grcevski commented 2 years ago

I'm adding here some of the information I've collected on implementing what the tools above capture:

uptime We can look at the 1, 5 and 15 minutes load averages to see if the CPU trends are same, increasing or decreasing. We can reimplement this by looking at /prod/uptime and /proc/loadavg. Preferably, collect this information from /proc/pressure/{cpu,memory,io} if they are available on the system.
vmstat -SM Capture vmstat counters, while changing the unit to MBs instead of KBs. This gives us information about: run queue length, disk utilization (block storage), swapping, context switches, CPU utilization. We can reimplement this by looking at /proc/stat (CPU), /proc/vmstat (run queue, context sw, swaps), /proc/meminfo (memory consumption)
iostat -xz Capture io statistics on all disk IO devices with extended statistics. With z we minimize output, it omits the devices which have no change in data. Can be reimplemented by reading /proc/diskstats
free -m Capture memory usage, including file system cache statistics. Can be reimplemented by reading /proc/meminfo
sar -n DEV Capture network statistics for each network device.
Can be reimplemented by reading /proc/net/dev
sar -n TCP,ETCP Capture TCP connection statistics and retransmit rates. Can be reimplemented by reading /proc/net/snmp and /proc/net/snmp6
netstat -i Capture overall network statistics to see historical network instability. Can be reimplemented by reading /proc/net/dev Tracking per-connection network statistics https://github.com/elastic/elasticsearch/pull/84653
pidstat (or top -b -n 1) Capture per process CPU utilization. Top is more verbose, pidstat only shows processes with CPU usage since last time it sampled the cpu stats. Can be reimplemented by looking at individual /proc//stat files
mpstat -P ALL Capture per CPU utilization information. With this we can get an idea if certain parts of Elasticsearch are single-threaded and we are not effectively using all CPUs, but are CPU bound. Can be reimplemented by looking at /proc/stat

BobBlank12 commented 2 years ago

From a support view, I would think we should focus on Metricbeat capturing all of the OS/system type metrics instead of the stack. For the stack, if node_stats caught all of the node/stack information it could and exposed it through monitoring, that would be useful.

DaveCTurner commented 2 years ago

That's true @BobBlank12 but it relies on customers running Metricbeat which they don't all do. Also we don't include Metricbeat-captured stats in diagnostic bundles so they are a pain to correlate with stack-side metrics, and they aren't subject to automatic analysis like diagnostics are. It'd be great to fix all that of course, but this seems like a quicker win.

Moreover by capturing system-level stats in ES we can express opinions about (and react to) poor system performance in ways that an external agent like Metricbeat cannot.

Note also that this is not an either/or thing, we can do both.

grcevski commented 1 year ago

After few discussions we have decided to take the following approach:

We'll create an API that will collect the raw metric files from the target systems and supply them via the new chunked streaming API.
The diagnostic tooling will be modified to collect all these per-node stats/metrics files
We'll write scripts/tooling (which will be maintained separately) that will digest the per node files and produce reports similar to the traditional Linux tooling.

This approach is preferable because of the following reasons:

Minimal changes to Elasticsearch - we won't have to keep extending the Elasticsearch code base to add new metric insights
The reporting tooling can be iterated separately and independently from Elasticsearch
Sometimes different in-depth analysis will be required that can only be enabled via access to the raw files.

Cons to the above approach is that, until the tooling is written to digest the collected files, the additional functionality will probably not be useful directly to our support organization. However, this additional time to write the tooling will not be any shorter if we were to implement it in Elasticsearch.

VimCommando commented 1 year ago

Cons to the above approach is that, until the tooling is written to digest the collected files, the additional functionality will probably not be useful directly to our support organization.

As a member of the support organization, I respectfully disagree. Having standardized metrics available in our diagnostics gives us a huge advantage when it comes to training and knowledge sharing. It cuts down the number of times we have to preface guidance with "Well, if you're lucky enough to have [fill in the blank] you can..."

Getting raw outputs to Linux-native commands, especially on something like a sar -A allows us to immediately start using preexisting Linux community tools like the venerable ksar.

grcevski commented 1 year ago

I think we'll be able to get somewhere with various existing community tools, but we won't be able to get all the way to replicate what the Linux tools produce as output. In the example above we won't be able to run sar to get the output, we don't allow Elasticsearch to launch external processes. Instead we'll be able to get the file that sar uses to produce the output (which is then fed to ksar), i.e. we'll get the raw /proc/net/dev instead of what sar does with it.

Most of these Linux tools use hardcoded paths, so while we'll have all the /proc/net/dev files from every node, we won't be able to directly feed them to sar, it will look for the local /proc/net/dev always.

DaveCTurner commented 1 year ago

To add: in many cases the raw data comprise cumulative statistics, and the tools mentioned above work by collecting these data repeatedly over a period of time and computing deltas. I don't think it's appropriate for Elasticsearch's stats APIs to do that kind of work: today we prefer to expose raw cumulative statistics and I would prefer that we continue to do so. To simulate the tools mentioned above would then require a sequence of API outputs taken over a period of time. That could mean a sequence of diag bundles or a tool that gets live values from the APIs.

bytebilly commented 1 year ago

Most of these Linux tools use hardcoded paths, so while we'll have all the /proc/net/dev files from every node, we won't be able to directly feed them to sar, it will look for the local /proc/net/dev always.

@grcevski I'm wondering if we could just reproduce the expected file structure in a folder and chroot tools into it, so we don't need to alter anything and we can reuse them as-is in case they can leverage text files rather than syscalls.

DaveCTurner commented 1 year ago

Re-upping my previous comment:

in many cases the raw data comprise cumulative statistics, and the tools mentioned above work by collecting these data repeatedly over a period of time and computing deltas.

I think this would be tricky to achieve by just reproducing the structure of /dev and /sys and friends. I would not expect to have data at fine enough temporal granularity to achieve this. Typically we would have just two diag bundles, likely taken many minutes apart.

elastic / elasticsearch

Collect more system information for diagnostics #88795