Large backlog of workflows in announced state (microservices restarting)

amaltaro commented 1 year ago

Impact of the bug MSOutput and MSRuleCleaner

Describe the bug Given that it's the second time that it happens within ~10 days, I am creating this issue to have this and the actions taken tracked here.

We have been piling up workflows in the announced state and it has reached 11.7k workflows at this very moment. Looking at the "Kube eagle k8s metrics" dashboard [1], we can see that it directly impacts the resources required by MSOutput and MSRuleCleaner services, which eventually run out of hardware resources and start either crashing or getting restarted by one of the k8s mechanisms put in place (k8s resource management, liveness probe, etc).

When I checked the kubernetes pods a few minutes ago, only the ms-output pod was having apparent restarts [2]. But if one inspect their log, it's possible to see that the service is getting restarted multiple times every minute [3]. The same applies to MSRuleCleaner [4]

How to reproduce it Exhaust the resources defined in the service yaml file, e.g.: https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/services/reqmgr2ms-output.yaml#L98-L104

Expected behavior This ticket is meant to keep track of actions to recover both MSOutput and MSRuleCleaner to a stable state and to get over the backlog of announced workflows.

A possible long-term and permanent solution - perhaps to be addressed in another ticket - to this would be to either stream data from CouchDB, or implement a way to slice data to be retrieved from the database backend, such that memory footprint can have a more predictable and controlled pattern.

Additional context and error message [1] https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1668442898738&to=1668486098738&refresh=30m

[2]

amaltaro@lxplus8s14:~ $ kubepods dmwm
NAME                                       READY   STATUS      RESTARTS       AGE
ms-output-67c6755866-r6xtx                 2/2     Running     22 (42m ago)   6d15h
ms-output-mongo-686dc4b9f-vp8bj            1/1     Running     2 (27d ago)    27d
ms-rulecleaner-8684b896b9-8kxrd            2/2     Running     0              6d15h

[3]

amaltaro@vocms0750:/cephfs/product/dmwm-logs $ grep WATCHDOG ms-output-20221115-ms-output-67c6755866-r6xtx.log | tail -n5
[15/Nov/2022:05:01:31]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:02:17]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:02:55]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:03:27]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:03:58]  WATCHDOG: server exited with exit code signal 9... restarting

[4]

amaltaro@vocms0750:/cephfs/product/dmwm-logs $ grep WATCHDOG ms-rulecleaner-20221115-ms-rulecleaner-8684b896b9-8kxrd.log | tail -n5
[15/Nov/2022:05:03:00]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:03:22]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:03:43]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:04:09]  WATCHDOG: server exited with exit code signal 9... restarting
[15/Nov/2022:05:04:38]  WATCHDOG: server exited with exit code signal 9... restarting

amaltaro commented 1 year ago

For the record, I edited both MSOutput and MSRuleCleaner resource requirements, such that the service becomes stable again and start going through the backlog.

Changes have been made locally in the cluster, so whenever a new service configuration is pushed in, it will overwrite my changes. It was just a matter of running the following command and adapting memory requested and limit to 500MiB and 1500MiB (compared to the original 250MiB and 750MiB):

amaltaro@lxplus8s14:~ $ kubectl edit deploy ms-rulecleaner -n dmwm
deployment.apps/ms-rulecleaner edited

and the same for MSOutput:

amaltaro@lxplus8s14:~ $ kubectl edit deploy ms-output -n dmwm
deployment.apps/ms-output edited

both services seem to be stable now and I can start seeing workflows getting archived in MSRuleCleaner.

We might consider updating the CMSKubernetes yaml files to limit these resources at 1GiB (or maybe 1.25GiB), until we can have a better control of memory footprint in these services.

vkuznet commented 1 year ago

@amaltaro I doubt your statement that after editing resources of MSOutput it became stable again is correct. At best, it is misleading. If you increase threshold and we did not fix the actual issue the service is quasi stable until next time it will hit the current threshold. This is similar to a "fix" we had for DBS for many years and yet we experienced that more and more resources are required to keep service stable. I rather want to see concrete set of actions required to fix the issue. For that we need to know the following information:

how service receive the data
how much data it is currently receive and how it reflects into python memory usage, in other words we need to know the growth rate of python memory for every MB of the data we recieve
can data be requested in chunks , if so how
what is baseline for RAM usage of the service itself, in other words how much RAM service need without fetching upstream data
if data streaming is possible can you determine a chunk size to allow service stay within its limit, e.g. 1GB RAM limit

todor-ivanov commented 1 year ago

Hi @vkuznet @amaltaro. My two cents here. They both get restarted with signal 9 as reported by the WATCHDOG (sent by who ever is monitoring the Pod's resource limits), but what I also notice is those signals are always received while fetching workflows in status announced [1]. So we have obviously reached a saturation point. We should also ask P&R if there was a recent change/event in the system's load that may have caused more rapid announcement of workflows. Otherwise we maybe just digesting an old backlog, or we may be facing a totally separate bug.

Something worth mentioning here - I also noticed some troublesome trend through out the month (every month not only the current one). We do tend to build up on memory usage for all micro services, until we get to the monthly upgrade. When the services get redeployed and the cycle repeats [2]. This indeed may be a separate story, but still seems a strange pattern.

[1]

2022-11-14 01:40:51,093:DEBUG:Service: getData: 
        url: request?status=announced&detail=True
        verb: GET
        incoming_headers: {}
        data: {}
[14/Nov/2022:01:41:00] ms-rulecleaner-8684b896b9-8kxrd 127.0.0.1 "GET /ms-rulecleaner/data/status HTTP/1.1" 200 OK [data: 305 in 367 out 972 us ] [auth: OK "" "" ] [ref: "" "Go-http-client/1
.1" ]
[14/Nov/2022:01:41:13]  WATCHDOG: server exited with exit code signal 9... restarting

2022-11-15 01:00:00,788:INFO:MSOutput: Fetching requests in status: announced
2022-11-15 01:00:00,810:DEBUG:Service: getData: 
        url: request?status=announced&detail=True
        verb: GET
        incoming_headers: {}
        data: {}
[15/Nov/2022:01:00:20] ms-output-67c6755866-r6xtx 127.0.0.1 "GET /ms-output/data/status HTTP/1.1" 200 OK [data: 300 in 242 out 1222 us ] [auth: OK "" "" ] [ref: "" "Go-http-client/1.1" ]
2022-11-15 01:00:20,742:INFO:Main: [15/Nov/2022:01:00:20] ms-output-67c6755866-r6xtx 127.0.0.1 "GET /ms-output/data/status HTTP/1.1" 200 OK [data: 300 in 242 out 1222 us ] [auth: OK "" "" ] [ref: "" "Go-http-client/1.1" ]
[15/Nov/2022:01:00:22]  WATCHDOG: server exited with exit code signal 9... restarting

[2] https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1665925548233&to=1668521148233&refresh=30m&viewPanel=46

https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1665925548233&to=1668521148233&refresh=30m&viewPanel=216

https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1665925583737&to=1668521183737&refresh=30m&viewPanel=218

https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1665925583737&to=1668521183737&refresh=30m&viewPanel=220

amaltaro commented 1 year ago

A possible long-term and permanent solution - perhaps to be addressed in another ticket - to this would be to either stream data from CouchDB, or implement a way to slice data to be retrieved from the database backend, such that memory footprint can have a more predictable and controlled pattern. ...

@vkuznet I do not understand what makes you say this. I did not say the underlying problem with the services have been fixed, I just said it's stable again. If you read the original issue description, this is what have been said: """ A possible long-term and permanent solution - perhaps to be addressed in another ticket - to this would be to either stream data from CouchDB, or implement a way to slice data to be retrieved from the database backend, such that memory footprint can have a more predictable and controlled pattern. """

so, yes, I am aware that we will hit this problem again.

vkuznet commented 1 year ago

@amaltaro , when you say stable again, please define stability, e.g. is it stable at 1GB RAM level, if so, does the increase of number of workflows by N percent will keep service stable? At which level it will break again? What I'm saying is that without clear picture about what influence service restart and clear understand conditions which will keep it from crashing saying that service is stable is misleading. What we may say that we do not know how current service will stay, it may be running fine for next 24h or not, we may say that until load in a system is below certain threshold service will run fine but above this level we do not know.

What I'm asking is to provide concrete metrics, limits, thresholds based on either monitoring data or stress tests which we can use to define stability of the service. So far I simply do not know if it is stable or not, that's it.

vkuznet commented 1 year ago

@todor-ivanov , do we know what request?status=announced&detail=True return? If so, can we measure RAM/CPU utilization if the returned data will increase by some factor. For instance, measure RAM/CPU if we have 10, 100, 1K, 10K workflows in announced state? If your observation is correct, i.e. the request to announced status lead to crashes, then we need to know what is a threshold in number of returned documents which cause the crash. This will at least tell us what is this threshold, which we also can reflect back to RAM/CPU usage. And, measuring with different number of docs can give us a clue how to properly adjust system k8s limits to avoid potential crashes in a future.

amaltaro commented 1 year ago

A service being stable means that it's functional according to its design, it does not matter if it uses 1MB of RAM or 1GB, provided that it's within the resource limits.

Yes, as I already mentioned, the service memory footprint is a function of the workflows in announced status. From the grafana dashboard, baseline memory footprint is around 200MB. Each 1k workflow seem to increase the memory footprint by 50-75MB. So, once it went beyond 10k input workflows, it started crashing because 750MiB wasn't enough for that POD.

I don't think we should have more than a few 1000s workflows in announced state at any given time (maybe 5k). Having said that, the 750MiB of memory limit thus look reasonable in normal conditions.

A careful evaluation of the MSRuleCleaner logs need to be done to see why workflows are piling up in this state. But from what I've seen 2 days ago, almost 100% of the requests were still waiting for the TAPE output data transfer to complete.

Maybe we should bump the resource limit to 1MiB and assume that the limit of the service is going to be 10k requests. But again, this will only work while things are behaving. To be really protected from this, we need to apply request slicying and only load into memory a small set of the workflows.

If you want to look deeper into this, let me know and I will unassign myself from this issue.

amaltaro commented 1 year ago

After making further noise this week, I decided to actually re-assess this situation.

It turns out Rucio seems to be doing just fine and the problem reported with tape transfers was a mistake from my side - I am deeply sorry for that - where a bug was inserted by the following PR in the last production deployment: https://github.com/dmwm/WMCore/pull/11357

In short, when I converted the double if statements to an if/elif block, I stopped executing the block of code responsible for evaluating the output tape transfers(!).

A bug-fix has been provided (and already applied to the production service) in this PR: https://github.com/dmwm/WMCore/pull/11384

dmwm / WMCore

Large backlog of workflows in announced state (microservices restarting) #11363