Inventory socket timeout Nagios alerts - update jvm memory

terrywbrady commented 1 month ago

Ashley's analysis suggests that increasing the JVM size will affect the swap issue spikes we see. We will increase the JVM on prod. Inventory from 1GB to 1.5GB on Monday and the reboot after patching on Tuesday will allow it to take effect.

elopatin-uc3 commented 4 weeks ago

Suggest testing on Stage by taking one inv host out of the load balancer, requesting to alter the other inv host to a larger instance type, and testing with the largest manifest we have to see where incremental increases in memory no longer help. Then we can determine what an optimal jvm memory configuration is.

mreyescdl commented 4 weeks ago

Increased Inventory JVM for prod to 1.5GB Will take effect after Tuesday 6/4 patching

dloy commented 4 weeks ago

A brief analysis logs

Some things to note with these exceptions:

no access log entry exists with an invalid response at the time Nagios reports it
in all of the reported cases there is a non-qualified ?t= value - the default is xhtml
an access entry is reported at the time from the same service without the t= property
the inventory server is getting hit with ~ 5 state requests a minute

Comments:

the error never occurs on the server - it's a timeout issue (if anything)
in 2 of the 3 occurrences the state request occurred within a second of some other request
in all cases there was no t=xml - is a response parse attempted and fails on a non-xml response
```
***** Nagios  *****
```

Notification Type: PROBLEM

Service: uc3-mrt-inventory-prd_state_7x16 Host: uc3-mrtinv-prd01 Address: uc3-mrtinv-prd01.cdlib.org State: CRITICAL

Date/Time: Tue Jun 4 10:40:21 PDT 2024

Additional Info: HTTP CRITICAL: Status line output matched HTTP/1.1 200 - 986 bytes in 5.684 second response time

Nagios

Notification Type: PROBLEM

Service: uc3-mrt-inventory-prd_status-running_7x16 Host: uc3-mrtinv-prd01 Address: uc3-mrtinv-prd01.cdlib.org State: CRITICAL

Date/Time: Tue Jun 4 10:41:07 PDT 2024

Additional Info: HTTP CRITICAL: HTTP/1.1 200 - 986 bytes in 9.534 second response time

Nagios

Notification Type: RECOVERY

Service: uc3-mrt-inventory-prd_state_7x16 Host: uc3-mrtinv-prd01 Address: uc3-mrtinv-prd01.cdlib.org State: OK

Date/Time: Tue Jun 4 11:00:21 PDT 2024

less localhost_access_log.2024-06-04.txt

172.30.32.237 - - [04/Jun/2024:10:40:21 -0700] "GET /state?t=xml HTTP/1.1" 200 747 172.31.14.167 - - [04/Jun/2024:10:40:21 -0700] "GET /state HTTP/1.1" 200 855

172.31.14.167 - - [04/Jun/2024:10:41:07 -0700] "GET /state HTTP/1.1" 200 855

172.31.14.167 - - [04/Jun/2024:11:00:21 -0700] "GET /state HTTP/1.1" 200 855 172.30.32.237 - - [04/Jun/2024:11:00:21 -0700] "GET /state?t=xml HTTP/1.1" 200 747

CDLUC3 / mrt-doc

Inventory socket timeout Nagios alerts - update jvm memory #1930

A brief analysis logs