dmwm / CMSRucio

7 stars 31 forks source link

Bug: Access to https://cms-rucio-webui.cern.ch/ seems to fail. #827

Closed eachristgr closed 1 month ago

eachristgr commented 4 months ago

Bug Description

Trying to access https://cms-rucio-webui.cern.ch/ returns a time out error.

Checking the logs of the relative pod, it seems like an Apache issue:

httpd-error-log [Thu Jul 18 08:12:36.487491 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
httpd-error-log [Thu Jul 18 08:12:37.488555 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

In the other hand, https://cms-rucio-webui-int.cern.ch/ seems to work fine

Reproduction Steps

No response

Expected Behavior

No response

Possible Solution

No response

Related Issues

No response

haozturk commented 4 months ago
[haozturk@lxplus996 ~]$ k logs webui-rucio-ui-5544c44759-2tnpf  
Defaulted container "httpd-error-log" out of: httpd-error-log, rucio-ui
[Thu Jul 18 05:48:33.333140 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
[Thu Jul 18 05:48:34.334202 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
dynamic-entropy commented 4 months ago

Thanks, Christos for reporting this. A simple restart should fix this and reset the connections with busy clients.

Kindly let me know if it happens again and we can change or update the server config to accommodate high loads. Can you please check if it works for you too, now?


Just for my record. Example use case for monitoring in : https://github.com/dmwm/CMSRucio/issues/381

eachristgr commented 4 months ago

Hi @dynamic-entropy, thanks for taking this. The issue seems to be resolved, I can access https://cms-rucio-webui.cern.ch/ without any problem.

haozturk commented 4 months ago

It happened again:

$ k logs webui-rucio-ui-f79f5b6db-fj86l  
Defaulted container "httpd-error-log" out of: httpd-error-log, rucio-ui
[Mon Jul 29 01:15:41.144028 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
[Mon Jul 29 01:15:42.145535 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
[Mon Jul 29 01:15:43.145649 2024] [mpm_event:error] [pid 7:tid 7] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

https://mattermost.web.cern.ch/cms-o-and-c/pl/d99nw33cwpbwmyodirit7sodsw

We need to revisit the server limits

ericvaandering commented 1 month ago

This was also seen by ATLAS. The solution suggested was to either increase an internal value or scale up the pods. Since we only ran one pod, I moved it to four. We should reopen if we see again.

It seems load related anyhow