galexrt / dellhw_exporter

Prometheus exporter for Dell Hardware components using Dell OMSA.
https://dellhw-exporter.galexrt.moe
Apache License 2.0
119 stars 41 forks source link

dellhw-exporter causes a large amount of zombie processes #85

Open deepankersharmaa opened 1 year ago

deepankersharmaa commented 1 year ago

Hi,

I have observed large number of omreport and omcliproxy processes generated but not exited or terminated.

Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: omreport invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: CPU: 110 PID: 1319442 Comm: omreport Kdump: loaded Not tainted 4.18.0-372.36.1.el8_6.mr3789_221121_2132.x86_64 #1
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: Tasks state (memory values in pages):
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12750] 0 12750 35965 615 167936 0 -1000 conmon
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12777] 0 12777 179450 4466 196608 0 999 dellhw_exporter
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12844] 0 12844 49366 1786 180224 0 999 dsm_sa_eventmgr
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12845] 0 12845 84507 2358 217088 0 999 dsm_sa_snmpd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12851] 0 12851 587311 10325 581632 0 999 dsm_sa_datamgrd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 14970] 0 14970 152261 5917 393216 0 999 dsm_sa_datamgrd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314763] 0 1314763 2926 650 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314764] 0 1314764 2926 637 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314765] 0 1314765 2926 650 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314766] 0 1314766 2926 663 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314767] 0 1314767 2926 638 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314768] 0 1314768 2926 637 61440 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314769] 0 1314769 2926 664 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314771] 0 1314771 2926 644 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314774] 0 1314774 2926 637 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314776] 0 1314776 2926 627 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314778] 0 1314778 2926 653 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314780] 0 1314780 9239 1179 114688 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314781] 0 1314781 2926 627 73728 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314783] 0 1314783 2926 650 61440 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314784] 0 1314784 9239 1199 114688 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314785] 0 1314785 9239 1190 118784 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314787] 0 1314787 2926 663 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314788] 0 1314788 9239 1205 122880 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314790] 0 1314790 9239 1145 114688 0 999 omcliproxy

I have posting all the results here as it would be redundant, but output similar to approximately 850 lines was seen following this. It is likely that these processes were started in the dellhw_exporter Pod. From the name of this Pod, I speculate that it is an application similar to an agent for monitoring Dell hardware. as Dellhw exporter had a omreport cmd wraper to it to get the data from machine.

Regarding the omreport and omcliproxy, i would like to confirm the following things:

if they behave such as extracting all files including information about the OS under /proc, leading to a sharp increase in the load on the system. Are these processes performing any processing that could cause a load on the system when the number of processes increases rapidly?

galexrt commented 1 year ago

It seems abnormal for over 800 of these processes to be running, is that correct?

It depends on the hardware, resources given to the exporter, and other factors, though 800 seems high. The processes the dellhw_exporter starts are expected to be closed either when completed or the commands time outs.

Is there any report of the dellhw_exporter Pod being in an abnormal state due to the oom-killer (for example, process proliferation like this time)?

There are no known issues with the dellhw_exporter in regards to not closing processes/OOM-ing if given the right amount of resources.

For example, with some monitoring agent applications, there is a scenario where processes proliferate

The processes are not meant to stick around, but it depends on exporter config, etc., how often the exporter would call the commands to get the (latest) info for the metrics.

Can you provide the logs of the dellhw_exporter

deepankersharmaa commented 1 year ago

Hi,

Thanks for your quick response and support.

Please find the below attached dellhw-exporter container log at the time of the problem occurred.

Regards, Deepankar

deepankersharmaa commented 1 year ago

dellhw-exporter_container-log_20231026.log

deepankersharmaa commented 1 year ago

Hi,

Thanks for your quick response and support. Do we have any updated regarding the same.

Regards, Deepankar

galexrt commented 11 months ago

The logs show that some omreport command processes are being terminated/taking too long.

deepankersharmaa commented 11 months ago

Hi @galexrt Thanks for you revert

M using the basic command for running exporter using below command there is no specific config/flags used and the scrape interval is 60 seconds. podman run --name pf-dell-exporter -d --privileged -p 9137:9137 {{exporter_image}}

Regards, Deepankar

deepankersharmaa commented 11 months ago

Hi @galexrt

Any Idea about this ?

galexrt commented 11 months ago

@deepankersharmaa The logs indicate that omreport is taking a long time to respond. Did you look into the Dell OMSA services on the machine if there's anything in their logs? Is that issue happening on a single server or multiple servers?

adidiborg commented 10 months ago

@galexrt , We are also facing similar issues. Looks like it happens randomly on multiple servers

galexrt commented 9 months ago

As written before without logs from the system's OMSA services with any hints it is hard to diagnose this.

I don't have access to a Dell server at the moment, so I would appreciate any logs or outputs from OMSA for me to dive in.