server: include kernel/IO/networking info in `debug.zip` and metrics

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.11k stars 3.81k forks source link

server: include kernel/IO/networking info in `debug.zip` and metrics #102125

Open erikgrinaker opened 1 year ago

erikgrinaker commented 1 year ago

It would be useful to include in debug.zip and/or metrics various OS/kernel info from every node, to inspect e.g. kernel params, TCP settings, and other relevant metrics when debugging kernel issues. For example:

sysctl -a
netstat -s
netstat -an
ps aux
mount
/proc/meminfo
ss --tcp -n -e

And probably lots of other stuff. We'll need to consider what we can and can't include wrt. redaction.

Jira issue: CRDB-27295 Epic: CRDB-32134

tbg commented 1 year ago

or better, add some of this to metrics so we have history and can work with the data better. Lots of numbers are in procfs

ubuntu@grinaker-231-0001:~$ awk '$1 ~ "Tcp:" { print $13 }' /proc/net/snmp
RetransSegs
906466

tbg commented 1 year ago

Wouldn't be surprised if gosigar and these other kinds of libraries already picked most of these up

erikgrinaker commented 1 year ago

Yes, even better, but there's a ton of OS metrics and I don't know if we want them clogging up the time series database. Maybe we can pick out a few particularly important ones.

tbg commented 1 year ago

Right, didn't mean to scrape everything under the sun, just that we mostly just have to find a library that has what we want and hook up the metrics we need. Not much reinventing the wheel should be needed here.

irfansharif commented 1 year ago

Some internal discussion here. I'm going to re-title this issue to include more networking metrics+diagnostics info.