dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

ReqMgr2 unable to serve too many (30k?) requests to the client - Nginx limitation(?) #11231

Closed amaltaro closed 1 year ago

amaltaro commented 2 years ago

Impact of the bug ReqMgr2 (but likely affecting any CMSWEB service)

Describe the bug Dima reported this issue a couple of weeks ago, which is the inability to retrieve active requests from ReqMgr2 through this API:

https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE

This problem is tightly coupled to the overload of the system, with around 30k active requests at the moment. Investigation with the curl command suggests it to be a problem/limitation in the CMSWEB infrastructure [1], failing requests with a 502 Bad Gateway.

The acquired status is the currrent worst offender, matching around 19k requests. Which takes more than a minute to be served to the client and transferring a total of 270MB [2]

How to reproduce it Query the API above

Expected behavior These requests and amount of data will become more common and we need to ensure that our applications and infrastructure is capable of dealing with it. We should start discussing this with the HTTP experts and understand what are the limitations of NGinx, AFAIK there is a limit of 80MB or so. We should try to fix and/or increase this limits in the CMSWEB cluster such that they meet the CMS user use cases. For a longer future, we need to start serving compressed data to the users, especially for these heavy data APIs.

FYI: @drkovalskyi

Additional context and error message [1]

$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE" > active_requests.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   150    0   150    0     0      0      0 --:--:--  0:02:48 --:--:--    33

real    2m48.514s
user    0m0.086s
sys 0m0.111s
$ cat active_requests.json 
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

[2]

$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired" > acquired_requests.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  270M    0  270M    0     0  3693k      0 --:--:--  0:01:15 --:--:-- 37.5M

real    1m15.023s
user    0m1.230s
sys 0m1.614s
amaltaro commented 2 years ago

@muhammadimranfarooqi Imran, could you please clarify what those limits are at the Nginx layer and try to check any logs that could confirm this suspicious?

vkuznet commented 2 years ago

Alan, I doubt that your conclusion is correct. The 502 gateway errors in fact says that the backend rather then FE is down, and therefore the FE can't reach it. I'm confident that nginx is capable to much higher load and since nginx stays on par with apache it can sustain in our setup around 500 req/sec. I also do not understand what MB in your example represent, Is it single response size or is it total size across all requests. Finally, your curl example does not represent and says anything about the BE performance. As such I suggest the following:

Bottom line, I doubt anything wrong with our FE throughput and rather the actual issue here is poor performance of python back-end servers. I saw only two ways to improve it:

amaltaro commented 2 years ago

Valentin, I don't think the problem here is throughput, but rather the ability to serve large data to the end users (it does not matter if it's 1 or 100 requests!). I remember discussing similar issues with Imran in the past and he mentioning certain limitations with the Nginx service. I have also had similar issues with eoscmsweb in the past, and the problem was with Nginx as well (502 Bad Gateway).

Anyways, I didn't get to any conclusions yet and I am actually trying to identify where the problem is. Let's see what Imran can contribute here.

vkuznet commented 2 years ago

My suggestions remains, please clearly define what is "large" data, and measure performance of BE first. And, as a first action please start using gzipped content as whatever you refer as large at the end can be compressed by many folds, and therefore may not require any action on infrastructure and/or FEs.

muhammadimranfarooqi commented 2 years ago

Hi @amaltaro

We have a dashboard for nginx. Please have a look.

https://monit-grafana.cern.ch/d/k8s-be-prod-nginx/cmsweb-k8s-be-production-nginx-ingress?orgId=11&from=now-2d&to=now

amaltaro commented 2 years ago

@muhammadimranfarooqi thanks for pointing us to that dashboard. Unfortunately I fail to see how to map any of that information to this issue to be investigated here.

@vkuznet since you have already joined this discussion, can you please follow this up then? Thanks!

vkuznet commented 2 years ago

Alan, I have little knowledge of the system and in order even to understand it I need more input from you. For instance:

Said that, I also mentioned that I have little time in Q3 due to my schedule and I may not have time to look into this issue until late Aug.

amaltaro commented 2 years ago

Those are all valid questions, but I think you do not need to know any of these in order to evaluate whether CMSWEB frontends/Nginx is the limiting factor here (or if the problem is with the backend, as you previously suggested).

I agree that we need to look into this issue taking into consideration longer term design and implementation, but the main goal with this one issue is to identify why we get the 502 Bad Gateway and see if this can be fixed in O(days) instead of weeks/months.

muhammadimranfarooqi commented 2 years ago

Hi @vkuznet @amaltaro

In nginx setting, we have put these restrictions.

config:
    client-body-timeout: "300"
    client-header-timeout: "300"
    proxy-body-size: 200m
    proxy-connect-timeout: "300"
    proxy-read-timeout: "300"
    proxy-send-timeout: "300"
    upstream-keepalive-timeout: "0"

That corresponds to timeout for one of the plot of t0_reqmon in the dashboard https://monit-grafana.cern.ch/d/k8s-be-prod-nginx/cmsweb-k8s-be-production-nginx-ingress?orgId=11&from=now-24h&to=now&viewPanel=92

Should we increase this 200MB upstream limit?

amaltaro commented 2 years ago
proxy-body-size: 200m

thank you for providing these details, @muhammadimranfarooqi . I'd suggest increasing it to 400MB then. Once it's deployed, we can test this specific call once again.

vkuznet commented 2 years ago

@muhammadimranfarooqi please do not change yet this setting on production cluster, I'll provide details in a bit

muhammadimranfarooqi commented 2 years ago

Yes, @vkuznet , I was waiting for your comment. I did not change anything yet

vkuznet commented 2 years ago

@amaltaro , I'm sorry but I completely disagree with your request. The FE are designed for high throughput for ALL CMSWEB service and we simply can't increase resources because BE misbehave. In this case, the increase of body size implies that the FE will be stuck with request and will not serve other clients. As I wrote, since you did not provide any description of API patterns I against any changes on FE. Let me clarify:

The solution is simple, and I already pointed out, BE and clients should start using gzip'ed content. Your API which ships 270MB can be easily fit in FE requirements if it will use gzipped content which will reduce this size by factor of 10-100, and no changes to FE will be required, and not blockage can be done on FE and BE. Please do not address a problem by simply requiring more RAM or resources. We battle it before with DBS and I'm against changing FE settings w/o clear understanding how resources will be used. Moreover, by doubling the size we simply post-pone the same issue on FE again. Do you confident that ReqMgr2 will not ship 400MB body size? I doubt it, since it is matter of time and load on our production systems. As such, any threshold increase should be accomplished with clear description how this happen, what are patterns, what are the clients, what is the rate, etc.

Said that, I also understand that we are talking about production service. As such, we may accept a change on FE as a temporary solution to allow current service to provide data to client. But the increase we will do is only change 200MB to 300MB (temporary) which will allow ReqMgr2 API to behave now and force us to change ReqMgr2 behavior asap to use gzipped content with highest priority. Does it sound as an acceptable compromise?

vkuznet commented 2 years ago

@muhammadimranfarooqi , we had discussion during today's WMCore meeting, and priority for this issue has been raised. Please proceed and increase threshold on nginx service to what Alan has suggested, i.e. 400MB. This will be a temporary fix until gzipped content will be put in place.

@amaltaro , please review memory usage of ReqMgr2 as it has spikes according to this monitoring dashboard. The spikes goes to up to 3GB in RAM (I don't know if they are related to this specific API but it may be the case). From previous experience we know that Python memory management is not ideal, and quite often RAM usage of server can increase significantly when serving large chunks of data. For example, the DBS spikes went up to 10GB and we were unable to run it reliably on k8s infrastructure.

muhammadimranfarooqi commented 2 years ago

HI @vkuznet @amaltaro

I have increased size from 200MB to 400MB.

amaltaro commented 2 years ago

@muhammadimranfarooqi @vkuznet thanks for updating it.

I tried the same query once again and it was successful:

$ time curl -v --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired" > acquired_requests.json
...
< HTTP/1.1 200 OK
< Date: Mon, 25 Jul 2022 16:11:11 GMT
< Server: Apache
< Content-Type: text/plain;charset=utf-8
< Vary: Accept,Accept-Encoding
< Cache-Control: max-age=3600
< X-Rest-Status: 100
< X-Rest-Time: 35687958.241 us
< CMS-Server-Time: D=35697288 t=1658765435763543
< Transfer-Encoding: chunked
< 
  0     0    0     0    0     0      0      0 --:--:--  0:00:35 --:--:--     0{ [data not shown]
100  271M    0  271M    0     0  3609k      0 --:--:--  0:01:16 --:--:-- 24.8M
* Connection #0 to host cmsweb.cern.ch left intact

real    1m17.004s
user    0m1.274s
sys 0m1.742s

which confirms my suspicioun of the frontend/Nginx limitations.

@drkovalskyi if you want to re-enable the original implementation you have on your monitoring setup, could you please report here how it goes?

vkuznet commented 2 years ago

And, for the record, the existing json size is 272MB while gzip compression provides 17 times reduction and reduce it to 16MB. As such the transfer time will be smaller too.

drkovalskyi commented 2 years ago

Thanks guys for looking into it. Unfortunately it doesn't work for me. I get:

    curl.perform()
error: (18, 'transfer closed with outstanding read data remaining')

I use https://cmsweb.cern.ch/reqmgr2/data/request?status=ACTIVE query, which is more than you expect with acquired alone.

vkuznet commented 2 years ago

@amaltaro , once again could you please clearly specify where the data comes from and what is a data size we should expect? It seems to me that doubling payload size only account for acquired workflows, but Dima's query specify a different conditions. Since we do not know the pattern of the query, neither have any idea about amount of data we should expect I hope you can provide this numbers that we can understand the scope and requirements of this API behavior. Moreover, we need to know (without approximate guess) what is the range we should expect from this API, i.e. do we need 400MB, 1GB or what is the upper limit here?

amaltaro commented 2 years ago

Sorry, I mixed up acquired and ACTIVE in my query.

Let me try to expand a bit on number and what to expect:

Regarding the query Dima runs, it's one of the heaviest ReqMgr2 queries because it's basically fetching all the active workflows in the system and their full documents from CouchDB. I could be wrong, but I think we do not have any other system executing such call. Dima can clarify as well, but this call shouldn't be executed more than a handful of times per day.

I notice that we no longer get the 502 Bad Gateway, instead the server responds with HTTP/1.1 200 OK, but then the connection gets closed before it's over. HTH.

[*] https://cmsweb.cern.ch:8443/reqmgr2/data/request?name=cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586

vkuznet commented 2 years ago

Alan, thanks for this info. Now we need to translate it into the load. If you say we have 19K acquired workflows which translates into roughly 300MB, and according to you previous post it takes 1m17sec, then 30K workflows will give almost 500-600MB but I have no idea how much time service will take to produce this data, and how much time it will take to serialize and transfer this data. If service completes requests within FE interval (5min), then client will get 200 OK, but client still need to fetch full data which accounts for network time (which will reflect in time from BE to FE and nginx). Since I do not know how data are produced and shipped to the client I can't answer this question. Is it possible that time to acquire this data to time to deliver this data to the client will take more than 5min? I do not know. All of this questions depends on how service is implemented, it depends on latency of database, on time spent on serialization of data, etc.

In order to answer this question you should go to your service and perform this call w/o any FE involvement, e.g. login to a pod with curl command, and fetch data from your service using curl local network call, e.g. curl http://reqmgr2.dmwm:port/api This call will go through k8s network (i.e. bypass FE) and you can measure how much time it will take. Once we know this number we can understand where the issue and where it should be addresses.

drkovalskyi commented 2 years ago

BTW, I cannot get https://cmsweb.cern.ch/reqmgr2/data/request?status=acquired with pycurl, while it works if use curl directly. I can switch to curl, but it's somewhat ugly.

vkuznet commented 2 years ago

@amaltaro I think we have a complex problem here (sorry for the long post but it outlines all technical details to understand the issue).

I finally was able to figure out how to get proper set of CMS HTTP headers, and set proper curl call to localhost of ReqMgr2 host, i.e. bypass FE entirely. And, I am unable to query ReqMgr2 service request API with ACTIVE value. Below you can see full series of steps which you (or anyone in with access to dmwm namesapce) can repeat. For security reason I strip off port number and I do not provide location of hmac file (but you should know this information).

For transparency here are the steps I did:

time ./rcurl http://localhost:<port>/reqmgr2/data/request?status=ACTIVE > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13    0    13    0     0      0      0 --:--:--  0:01:06 --:--:--     3
curl: (18) transfer closed with outstanding read data remaining

real    1m6.421s
user    0m0.011s
sys     0m0.023s

But I can run sucessfully the following comamnd:

time ./rcurl "http://localhost:8246/reqmgr2/data/request?status=acquired" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  271M    0  271M    0     0  7055k      0 --:--:--  0:00:39 --:--:-- 62.7M

real    0m39.390s
user    0m0.065s
sys     0m0.123s

It tells me that the following:

From the above observations, I still conclude that the best solution is to enable gzip encoding. But, since it is not yet in ReqMgr2 and in order to move forward we need the following info:

For the record, here is the code to get CMS HTTP headers using hmac file:

#!/usr/bin/env python

import hmac, hashlib

def authz_headers(hmac_key):
  """Create fake authentication and authorisation headers compatible
  with the CMSWEB front-ends. Assumes you have the HMAC signing key
  the back-end will use to validate the headers.

  :arg str hmac_key: binary key data for signing headers.
  :returns: list of header name, value tuples to add to a HTTP request."""
  headers = { "cms-auth-status": "OK",
              "cms-authn-method": "ServerMonitor",
              "cms-authn-login": "server-monitor",
              "cms-authn-name": "Server Monitor" }
  prefix = suffix = ""
  hkeys = headers.keys()
  for hk in sorted(hkeys):
    if hk != "cms-auth-status":
      prefix += "h%xv%x" % (len(hk), len(headers[hk]))
      suffix += "%s%s" % (hk, headers[hk])

  cksum = hmac.new(hmac_key, prefix + "#" + suffix, hashlib.sha1).hexdigest()
  headers["cms-authn-hmac"] = cksum
  return headers.items()

fname= # NOTE: SHOULD BE SET using location of our hmac file
authz = authz_headers(open(fname).read())
print(authz)
vkuznet commented 2 years ago

@drkovalskyi I think intermediate solution here to get data in chunks, i.e. you can make individual calls to ReqMgr2 service using different states, e.g. status=new, status=acquired, etc. Since individual states data are smaller I would expect that you can get all info by doing multiple calls with different statuses. Once we'll figure out the size of data for ACTIVE status we can do something about it on k8s infrastructure using recipe I outlined in my previous comment.

amaltaro commented 2 years ago

Thank you for checking this in the reqmgr2 backends, Valentin. Your comment about a double transfer is valid and indeed a concern.

I had a look at the WMCore code, and it looks like ReqMgr2 is actually triggering the following POST request to CouchDB (converted to curl):

curl -H 'Content-Type: application/json' --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?include_docs=true" -d '{"keys": ["new","assignment-approved","assigned","staging","staged","acquired","running-open","running-closed","failed","force-complete","completed","closed-out","announced","aborted","aborted-completed","rejected"]}' > include_docs.json

where it queries a CouchDB view and filters requests against a large list of keys (in addition to asking for the whole document to be transferred). I just executed this call a few times and it takes in average 25secs to finish, the json size - in the file system - is around 315MB.

I logged in to vocms0742 - which hosts this database - and executed the same query against localhost. Time to get it server is very similar to the one going through the frontends (25secs as well).

I googled for the possibility of requesting compressed data from CouchDB view, but it doesn't seem to be supported.

I also googled for that "transfer closed ..." error, and stack overflow suggests a possible tweak/configuration that can be evaluated for Nginx: https://stackoverflow.com/questions/33522615/curl-transfer-closed-with-outstanding-read-data-remaining

Lastly, concerning the spikes of memory footprint for ReqMgr2, I think this query could easily give ReqMgr2 an extra GB of memory footprint, since it has to load all this CouchDB data in memory, and then an extra data structure is created after applying some filters on the output of CouchDB. See this method formatting CouchDB output, before it's transferred to the user: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/RequestDB/RequestDBReader.py#L62

drkovalskyi commented 2 years ago

@drkovalskyi I think intermediate solution here to get data in chunks, i.e. you can make individual calls to ReqMgr2 service using different states, e.g. status=new, status=acquired, etc. Since individual states data are smaller I would expect that you can get all info by doing multiple calls with different statuses. Once we'll figure out the size of data for ACTIVE status we can do something about it on k8s infrastructure using recipe I outlined in my previous comment.

I already do that. If I exclude acquired I can get everything else in one call. I cannot get acquired though. It fails with pycurl.

drkovalskyi commented 2 years ago

Actually I now can get acquired request finish successfully pycurl. Haven't changed anything on my side. So operationally the pressure is reduced, since we have a working solution even if it's not complete reliable.

vkuznet commented 2 years ago

@amaltaro , thanks for giving more feedback about actual CouchDB query, transfer time and additional links. I think the issue with large payload is related with closed connection within ReqMgr2 API itself. From your stack overflow link I see two solutions:

I rather prefer to adjust HTTP header of the client to re-use HTTP connection instead of changing nginx settings. The nginx provides access to all services and any change to it should be carefully evaluated to not allow breaks in other sub-systems or open a door for various attacks from clients which may happen unintentionally and without knowledge of the client.

I'm not sure where exactly in WMCore codebase the HTTP headers are set for queries to CouchDB, but it would be useful if you check this first and see if keep-alive HTTP header is set within ReqMgr2 calls which it makes to CouchDB, e.g. Keep-Alive: timeout=60, max=1000 (here the timeout should specify timeout time range which should cover the time required to fetch CouchDB data, and max should specify maximum time allowing to keep connection).

vkuznet commented 2 years ago

@drkovalskyi I also suggest that you try to setup Keep-Alive: timeout=200, max=1000 option with your pycurl client, even though I'm not sure if it will help in this case.

The problem here is double connection to our FEs:

I do not remember if our FE (and nginx in this case too) propagates client's options through entire HTTP request, and it most likely depends if ReqMgr2 (and WMCore acknowledge) client's HTTP headers. If so, this option should be applied across entire requests, so it worth to try it out.

vkuznet commented 2 years ago

@amaltaro, @drkovalskyi where do we stand on this issue? Did you try to use Keep-Alive HTTP header in a client?

vkuznet commented 2 years ago

@amaltaro , @drkovalskyi did you tried the Keep-Alive patch I provided? If so, does it affect anything?

amaltaro commented 2 years ago

@vkuznet not yet Valentin. I have a few other things in the pipeline before I can get to it. I should be able to look into it by the end of this week (this test isn't that simple btw).

vkuznet commented 1 year ago

@amaltaro , I got the numbers. All tests passed with the following timing:

real 0m16.655s user 0m3.042s sys 0m4.472s

lxplus# ls -alh /tmp/data.json -rw-r--r--. 1 valya zh 554M Sep 22 16:50 /tmp/data.json


Therefore, from this single test I would say that `FE+nginx+BE` data flow works just fine. If you want we can now perform a test with some number of concurrent clients, just specify how many clients should we generate. Please note, here there is no KeepAlive header as it is not set by default in curl.
amaltaro commented 1 year ago

Thanks Valentin. I think we can test it with up to 15 concurrent requests (steps of 5).

vkuznet commented 1 year ago

I created this ticket https://its.cern.ch/jira/browse/CMSKUBERNETES-185 to perform the tests and asked @muhammadimranfarooqi to provide break down.

vkuznet commented 1 year ago

According to results published in https://its.cern.ch/jira/browse/CMSKUBERNETES-185 you can see the following: for 10 concurrent clients with payload 500MB we have 0 failures with keep-alive, while we have 6 failures without keep-alive. This is exactly what I was expecting. When we start hitting with more concurrent clients then results (with or without keep alive) are the same. It means that this PR may help to fix the problem for low concurrent load on our infrastructure.

Said that, the actual fix is to reduce payload size by simply using gzip encoding.

vkuznet commented 1 year ago

@amaltaro , do we need keep this open? My understanding that the actual issue is related to large payload which we addressed with gzip encoding. Please make a decision if we need to keep this issue open.

amaltaro commented 1 year ago

We have been running beyond 30k requests for the past 2 months, and things are looking much better. Next week we deploy support to gzip response content as well, so I would suggest to close it. Thank you for your contribution on this, Valentin.

amaltaro commented 1 year ago

Just an FYI. Given that I needed to use the instructions provided in this comment: https://github.com/dmwm/WMCore/issues/11231#issuecomment-1195620697

to test t0_reqmon, and that it required further changes to the script generating the http headers. I decided to consolidate it in this wiki page: https://github.com/dmwm/WMCore/wiki/REST-localhost-tests

todor-ivanov commented 1 year ago

Thanks @amaltaro and @vkuznet for putting those instructions in one place. These are indeed useful. Alan we may actually link those also from the troubleshooting wiki page as well.

amaltaro commented 1 year ago

@todor-ivanov good idea. Can you take care of it? Thanks