dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

file descriptors surge #603

Closed yuyiguo closed 4 years ago

yuyiguo commented 5 years ago

@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

belforte commented 5 years ago

sorry, no clues from me. Stefano

On 08/05/2019 15:49, Yuyi Guo wrote:

@vkuznet https://github.com/vkuznet @h4d4 https://github.com/h4d4 @bbockelm https://github.com/bbockelm @belforte https://github.com/belforte @amaltaro https://github.com/amaltaro @h4d4 https://github.com/h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmwm/DBS/issues/603, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOAVWIYN4SE4OT6H6Z2VDDPUM4GFANCNFSM4HLVDZEQ.

amaltaro commented 5 years ago

Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one single process in a backend?

If one look at /prod/PID/fd/ in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors.

I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests.

vkuznet commented 5 years ago

Alan, the fds we get from /proc/pid/fd area, you're write we can look for them over there. But it should be done in real time since they may be closed. When alarm is gone it is too late to look over there since they'll disappear.

Therefore, the only way I see to look them up is to add logic to exporter code, e.g. if fds will go above threshold just dump content of /proc/pid/fd somewhere that later (when we notified by alarm) we can look at them.

When I'll have time I can implement this in exporter.

On 0, Alan Malta Rodrigues notifications@github.com wrote:

Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one single process in a backend?

If one look at /prod/PID/fd/ in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors.

I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490762472

vkuznet commented 5 years ago

Here is relevant ticket: https://github.com/vkuznet/cmsweb-exporters/issues/1

As I wrote I can write some code for it once I'll have time.

On 0, Yuyi Guo notifications@github.com wrote:

@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603

vkuznet commented 5 years ago

Yuyi, meanwhile, until I'll add this info to exporters, you/Lina can simply look-up on a node the fd area, e.g. here is an example of some process and its fds:

ls -al /proc/28149/fd/
total 0
dr-x------. 2 cmspopdb users  0 May  7 13:57 .
dr-xr-xr-x. 9 cmspopdb users  0 Feb 25 21:03 ..
lrwx------. 1 cmspopdb users 64 May  9 14:31 0 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May  9 14:31 1 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May  7 13:57 2 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May  9 14:31 255 -> /dev/pts/1
lr-x------. 1 cmspopdb users 64 May  9 14:31 3 -> /var/lib/sss/mc/passwd
lrwx------. 1 cmspopdb users 64 May  9 14:31 4 -> socket:[134636807]

So if you look DBS pid area at the time when fds number is high you should see what are those fds.

At the moment I see from monitoring plots that DBS experience high number of fds every 1/2 hour. The number of fds is above 1K.

On 0, Yuyi Guo notifications@github.com wrote:

@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603

yuyiguo commented 5 years ago

Thanks Alan, Valentin, I am watching ...

yuyiguo commented 5 years ago

total 0 dr-x------. 2 _dbs _dbs 0 May 9 14:08 . dr-xr-xr-x. 9 _dbs _dbs 0 May 9 14:08 .. lr-x------. 1 _dbs _dbs 64 May 9 14:08 0 -> /dev/null l-wx------. 1 _dbs _dbs 64 May 9 14:08 1 -> pipe:[24875653] lrwx------. 1 _dbs _dbs 64 May 9 14:21 10 -> socket:[25211083]

yuyiguo commented 5 years ago

@vkuznet @amaltaro above is on vocms0163. How do I know what were they?

bbockelm commented 5 years ago

You are going to want to use “lsof -p $PID” instead; that will determine the IP address of all those sockets listed.

yuyiguo commented 5 years ago

OK, thanks

belforte commented 5 years ago

IIUC socket means network connection, so they better be from FrontEnd. Do we have a counter of FE connections to compare with ? IIUC nobody but the FE is allowd to talk to BE servers. Why did not FE nodes "die first" ?

On 09/05/2019 08:39, Yuyi Guo wrote:

@vkuznet https://github.com/vkuznet @amaltaro https://github.com/amaltaro above is on vocms0163. How do I know what were they?

bbockelm commented 5 years ago

Sockets can also be outgoing - to Oracle!

yuyiguo commented 5 years ago

This a file from command “lsof -p $PID” on 0136 when there were about 600+ fds. fd-0848-0136-600.txt

yuyiguo commented 5 years ago

@belforte @amaltaro @vkuznet @bbockelm @h4d4
vocms0158 vocms0760 vocms0162 vocms0164 are FEs. A lot of "CLOSE_WAIT" and "ESTABLISHED" to them.

yuyiguo commented 5 years ago
CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.

Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the 
connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).
belforte commented 5 years ago

of course it could be zilions of external requests that FE happily passes on, but ... thousands of them ? Or is this some FE failure ? do we (Lina?) have some monitoring FE side ?

yuyiguo commented 5 years ago

Got this from stackoverflow. How we close the connection to the FE? I could not remember that I did the closing explicitly in DBS code. Can someone reminder me?

vkuznet commented 5 years ago

Yuyi, simple google search leads to this open ticket: https://github.com/cherrypy/cherrypy/issues/1304

which provides some recipes to handle this issue, e.g. increase number of threads, setting the response.headers.connection config to always return a value of close.

In short run, I suggest you study this ticket and apply stuff people are taking about, while in a long-run I really suggest that we should start seriously consider replacing CherryPY stack in our python based web-apps, the best probably Flask+wsgi.

belforte commented 5 years ago

yet... why all such problems now ? Did we chance cherrypy ? Did FE change ? is there some external request storm ? I see some signs of FE issues in CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ?

On 09/05/2019 09:53, Valentin Kuznetsov wrote:

which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|.

yuyiguo commented 5 years ago

This is really odd. We already had fd monitoring before the release. We did not see fd surge to 10k+ in the past monitoring before this release.

h4d4 commented 5 years ago

@belforte @yuyiguo Nothing was changed into latest deployment for frontends.

h4d4 commented 5 years ago

@belforte @yuyiguo For FE monitoring it is what we have:

https://monit-grafana.cern.ch/d/thT2ibCiz/frontend-servers?refresh=1m&orgId=11&from=now%2Fw&to=now%2Fw&var-host=vocms0158&var-host=vocms0162&var-host=vocms0164&var-host=vocms0760&var-port=18443

vkuznet commented 5 years ago

Stefano, I'm not sure that we claim it happens "now", we had lemon metrics outages long time ago and it was affecting different services, e.g. DBS.

The point is that now we have better monitoring tools to reveal our problems, and we have identified some "hidden" patterns which lead to DBS instabilities. But we may not have know all possible combinations.

I saw fds alarms in previous release too, but we "let it go".

On 0, Stefano Belforte notifications@github.com wrote:

yet... why all such problems now ? Did we chance cherrypy ? Did FE change ? is there some external request storm ? I see some signs of FE issues in CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ?

On 09/05/2019 09:53, Valentin Kuznetsov wrote:

which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490942551

yuyiguo commented 5 years ago

[root@vocms0136 tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n 1 established) 1 Foreign 4 FIN_WAIT2 38 LISTEN 472 ESTABLISHED 952 CLOSE_WAIT 1560 TIME_WAIT

belforte commented 5 years ago

VAlentin, yes ! But since yesterday DBS is being restarted every few hours. This is new !

yuyiguo commented 5 years ago

This is what on cherrypy/cherrypy#1304 that is a very long and old opentk (opened Marc 2014) . "We experienced nearly the same scenario a while back. The issue is that CherryPy does not have a good way to handle HTTP Keep-Alive connections and so they begin to pile up with a default timeout time specified by the server.socket_timeout parameter. We resolved this problem by increasing the number of threads to 200 (probably a bit much), and setting the response.headers.connection config to always return a value of 'close'. This asks the browser to open a new TCP connection for each new request and tear it down after getting the response. We are currently experimenting with gunicorn and uwsgi, both of which appear to handle Keep-Alives in a better way than CherryPy."

yuyiguo commented 5 years ago

@h4d4 Lina, Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'? We can start with of the instances.

vkuznet commented 5 years ago

Yuyi, I suggest to take time window when DBS had large fds and identify all clients during this period (sort them and find top-N). It may be that we have a client which repeatedly initiate request and drops it w/o waiting for response (possibly impatient clients).

Using this dashboard https://monit-grafana.cern.ch/d/_U6nmxCmk/dbs-global-reader?refresh=1m&orgId=11&from=now-6h&to=now

I see node/periods:

vocms0136: 6:30 - 7:00, 8-8:12, 9:46 - 10:00, 11:17 - 11:30 etc.

We can also use CMSWEB timber dashboard to identify top-N in these periods, e.g. https://monit-grafana.cern.ch/d/QADAkezZk/cmsweb-timber?orgId=11&from=now-12h&to=now&var-system=dbs&var-method=All&var-api=All&var-code=All&var-metadataType=All&var-Filters=data.code%7C>%7C99&var-dnFilter=.*

It shows that DBS had datasetlist, fileArray calls around these spots, e.g. 7:00, 10:00, etc.

V.

On 0, Yuyi Guo notifications@github.com wrote:

[root@vocms0136 tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n 1 established) 1 Foreign 4 FIN_WAIT2 38 LISTEN 472 ESTABLISHED 952 CLOSE_WAIT 1560 TIME_WAIT

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490952634

vkuznet commented 5 years ago

Yuyi, it is not under Lina's responsibility, it is DBS (CherryPy) configuration parameters. Find how you can set them up in DBS and make appropriate PR either to DBS configuration or to the code (if you don't have them as external configuration). V.

On 0, Yuyi Guo notifications@github.com wrote:

@h4d4 Lina, Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'? We can start with of the instances.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490957936

amaltaro commented 5 years ago

BTW, updating the amount of threads is a DBS configuration parameter (note there are diff config files for different DBS instances): https://github.com/dmwm/deployment/blob/master/dbs/DBSGlobalReader.py#L35

About the header response, you'd need to make changes to the DBS source code, I believe.

We also need to be careful, because if we were sometimes hitting 9GB of RAM footprint with 15 threads, with 200 chances to blow up the process/node are going to be much higher.

yuyiguo commented 5 years ago

I thought that was only a configuration file change without studying it more.

yuyiguo commented 5 years ago

@vkuznet Valentin, I did a lot of study to try to connect a API to the fd, but the fd surge was so quick and each time there were different APIs. So I could not connect for sure. From yesterday's study on the issue, the top APIs are filechildren, datasets and filelumis, but you saw filearray today.

I really believe this is a system problem, not a DBS problem. It appears on DBS because DBS is the most used in this cmsweb world. It would appear on others too if their usage were as high as DBS.

belforte commented 5 years ago

do we know that surge in FD is the reason for DBS being unresponsive and eventually restarted by IT operators ? Or the other way around that DBS gets unresponsive for some yet-to-found reason and FD's skyrocket as FrontEnd tries more and more to connect ? The FE dashboard Lina pointed to https://monit-grafana.cern.ch/d/thT2ibCiz/frontend-servers?refresh=1m&orgId=11&from=now-24h&to=now&var-host=vocms0158&var-host=vocms0162&var-host=vocms0164&var-host=vocms0760&var-port=18443 shows no increase in Apache requests/accessess, while DBS FD's have those humongous peaks https://monit-grafana.cern.ch/d/_U6nmxCmk/dbs-global-reader?refresh=1m&orgId=11

If there was no increase in FE traffic, I'd rather not blame remote clients, which also most likely did not change their use pattern "today".

And if the problem is server not responding, indeed accepting more cherrypy threads will not help, as Alan pointed out.

vkuznet commented 5 years ago

Yuyi, DBS is not biggest player on cmsweb according to [1]. Both Phedex and CouchDB have 2x more accesses than DBS. And, on Phedex runs on the same nodes as DBS. But Phedex is more stable than DBS.

I now how hard to find the issues and I'm helping as much as I can, but I doubt it is instabilities of the system, since other services run fine on the same nodes.

V.

[1] https://monit-grafana.cern.ch/d/0h4m5ciZk/cmsweb-usage?orgId=11

On 0, Yuyi Guo notifications@github.com wrote:

@vkuznet Valentin, I did a lot of study to try to connect a API to the fd, but the fd surge was so quick and each time there were different APIs. So I could not connect for sure. From yesterday's study on the issue, the top APIs are filechildren, datasets and filelumis, but you saw filearray today.

I really believe this is a system problem, not a DBS problem. It appears on DBS because DBS is the most used in this cmsweb world. It would appear on others too if their usage were as high as DBS.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490967833

yuyiguo commented 5 years ago

But none of Phedex and CouchDB is using cherrypy. From the cherrypy tk, we already saw this is cherrypy problem that others had and no solution since 2014.

vkuznet commented 5 years ago

Stefano, I think Lina can help and provide timestamps of lemon alarms to answer your question. I already pointed out time windows when fds grown, now we need to correlate this with lemon alarms.

@h4d4, Lina, do we have history of them, can you look them up?

On 0, Stefano Belforte notifications@github.com wrote:

do we know that surge in FD is the reason for DBS being unresponsive and eventually restarted by IT operators ? Or the other way around that DBS gets unresponsive for some yet-to-found reason and FD's skyrocket as FrontEnd tries more and more to connect ? The FE dashboard Lina pointed to https://monit-grafana.cern.ch/d/thT2ibCiz/frontend-servers?refresh=1m&orgId=11&from=now-24h&to=now&var-host=vocms0158&var-host=vocms0162&var-host=vocms0164&var-host=vocms0760&var-port=18443 shows no increase in Apache requests/accessess, while DBS FD's have those humongous peaks https://monit-grafana.cern.ch/d/_U6nmxCmk/dbs-global-reader?refresh=1m&orgId=11

If there was no increase in FE traffic, I'd rather not blame remote clients, which also most likely did not change their use pattern "today".

And if the problem is server not responding, indeed accepting more cherrypy threads will not help, as Alan pointed out.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-490978989

yuyiguo commented 5 years ago

I could not find cherrypy response.headers.connection anywhere. I searched the cherrypy source code. Can someone sheds light here?

h4d4 commented 5 years ago

@belforte @vkuznet Valentin, yes I can check the time for DBS alarms. Since latest production upgrade?

Just to clarify, regarding cmsweb frontends (FE) if there is an overload, lemon trigger 'exception.high_load' . But have been a while since we do no got those alarms. Actually the lastest alarm showing overload in frontends was on october 2018 Exceptions: exception.high_load

yuyiguo commented 5 years ago

@h4d4

Lina, Could you run netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n on FEs?

I could not log on them. I'd like to see what is fds there.

Thanks, Yuyi

yuyiguo commented 5 years ago

This may not help us, but for information. I talked with colleagues at fermilab. It seems that they no longer use cherrypy, instead they are using uwsgi. The main reason for them to switch was that cherrypy did not manage socks well and there were a lot of CLOSE_WAIT left. I was told that CLOSE_WAIT would eventually go away. I guess that might explain why we the CLOSE_WAIT caused a lot of surging when the load was low.

yuyiguo commented 5 years ago

There is no response.headers.connection in cherry. I am not sure what I missed here. import cherrpy dir(cherrypy.response.headers) ['class', 'cmp', 'contains', 'delattr', 'delitem', 'dict', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getitem', 'gt', 'hash', 'init', 'iter', 'le', 'len', 'lt', 'module', 'ne', 'new', 'reduce', '__reduce_ex', 'repr', 'setattr', 'setitem', 'sizeof', 'str', 'subclasshook', 'weakref__', 'clear', 'copy', 'elements', 'encode', 'encode_header_items', 'encodings', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'output', 'pop', 'popitem', 'protocol', 'setdefault', 'update', 'use_rfc_2047', 'values', 'viewitems', 'viewkeys', 'viewvalues'] dir(cherrypy.response.headers.connection) Traceback (most recent call last): File "", line 1, in

h4d4 commented 5 years ago

@yuyiguo I just sent you an email with the output for the pointed command, since it is a huge list.

vkuznet commented 5 years ago

Yuyi, it is what was suggested by one of the people on cherrypy issue thread as a remedy for this problem. Do you know how to build/install/run cherrypy under uwsgi?

I also didn't find response.headers.connection config, but all cherrypy headers can be set as following:

cherrypy.response.headers['Connection'] = 'close'

in any method where you return your results.

On 0, Yuyi Guo notifications@github.com wrote:

This may not help us, but for information. I talked with colleagues at fermilab. It seems that they no longer use cherrypy, instead they are using uwsgi. The main reason for them to switch was that cherrypy did not manage socks well and there were a lot of CLOSE_WAIT left. I was told that CLOSE_WAIT would eventually go away. I guess that might explain why we the CLOSE_WAIT caused a lot of surging when the load was low.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-491006948

yuyiguo commented 5 years ago

Valentin, I don't know how to run cherrypy with uwsgi.

I am confused your message:

headers can be set as following:

cherrypy.response.headers['Connection'] = 'close'

in any method where you return your results.

DBS does not directly call cherrypy. It calls wmcore ._addMethod. Here is an simplest DBS API defination:

return dict(dbs_version=self.version, dbs_instance=self.instance)

So where should add cherrypy.response.headers['Connection'] = 'close'?

WMCore Root.py configure cherrypy. Should the connection config go there?

yuyiguo commented 5 years ago

Lina had the fds for FEs.

[cmsweb@vocms0158 ~]$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 established)
      1 Foreign
      2 FIN_WAIT2
     19 LISTEN
     25 ESTABLISHED
    385 TIME_WAIT
[cmsweb@vocms0158 ~]$

[cmsweb@vocms0760 ~]$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 established)
      1 FIN_WAIT1
      1 Foreign
      2 SYN_RECV
      6 FIN_WAIT2
     19 LISTEN
     42 ESTABLISHED
    929 TIME_WAIT
[cmsweb@vocms0760 ~]$

[cmsweb@vocms0162 ~]$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 established)
      1 Foreign
      2 SYN_RECV
      6 FIN_WAIT2
     19 LISTEN
     33 ESTABLISHED
    338 TIME_WAIT
[cmsweb@vocms0162 ~]$

[cmsweb@vocms0164 ~]$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 established)
      1 FIN_WAIT1
      1 Foreign
      2 FIN_WAIT2
      3 SYN_RECV
     19 LISTEN
     32 ESTABLISHED
    330 TIME_WAIT
[cmsweb@vocms0164 ~]$
belforte commented 5 years ago

do we expect any specific correlation between FD's on FE and BE ? IIUC the problem is that cherrypy leaks FD's, so FE closes the connection but BE keeps a stale socket around. Do you know for a fact that those stale sockets are a problem ? Did DBS hang for having reached a FD limit ?

vkuznet commented 5 years ago

Here is a patch to start WebTools/Root.py under uwsgi: https://github.com/dmwm/WMCore/pull/9189 In order to make it work in cmsweb we'll require to adjust manage script of the service which will use uwsgi with CherryPy and Root.py

Regarding, cherrypy configuration. I think it should be added to _addMethod since I don't if response object is defined per function or globally. Within local tests I placed inside my method which returned the data.

yuyiguo commented 5 years ago

@vkuznet @amaltaro @h4d4 Valentin, Thank you so much for such lighting speed of making the uwsgi patch. I am going to test it on my VM first. Alan, Could you please commit it to 1.1.16-dbs branch?

vkuznet commented 5 years ago

Please keep in mind that there are plenty of uwsgi options, from choice of number of threads to choice of thread pool, routing, to auto-reloading of failed python modules.

I only used/show default one how to specify port and python module. The changes to manage script should be trivial, i.e. setup configuration environment and run uwsgi. But I didn't explored how to setup individual DBS ports.

I suggest later to carefully study uwsgi documentation here: https://uwsgi-docs.readthedocs.io/en/latest/WSGIquickstart.html

On 0, Yuyi Guo notifications@github.com wrote:

@cvuosalo @amaltaro @h4d4 Valentin, Thank you so much for such lighting speed of making the uwsgi patch. I am going to test it on my VM first. Alan, Could you please commit it to 1.1.16-dbs branch?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/603#issuecomment-491097940

h4d4 commented 5 years ago

@vkuznet @yuyiguo Yuyi, maybe what I'm going to say here is nothing new but I'm checking lemon alarms since latest production upgrade, and there is a direclty correlation between: 'exception.cmsweb_dbs_is_not_responding' lemon alarms and 'cmsweb DBS-globalR fds' alerts. For alarms from Wednesday until Today. Expecting alarms from Tuesday when alarms start to show up.

The alerts are being triggered into the same nodes(backends) where the lemon alarms are being triggered too. Those are generated by exporters when 'Number of DBS file descriptors is high (above 1000).' And are sent to my CERN account, I don't know if you as service owner also receive that alerts, their configuration is under Valentin's monitoring team domain.