error: [Errno 24] too many open files

dmwm / DBS

CMS Dataset Bookkeeping Service

Apache License 2.0

7 stars 21 forks source link

error: [Errno 24] too many open files #595

Closed yuyiguo closed 4 years ago

yuyiguo commented 5 years ago

DBS server had "error: [Errno 24] too many open files" around 17/Feb/2019:20:07:04. This made DBS server cannot accept new connections from the client. Below are the email exchanges among developers and cmsweb admin. We will debug the system based on the suggestion below. However, the limit of file descriptor of 1024 is the default and it is too low for DBS. We will ask cmswen admin to increase it first.

Hi,
thanks Lina for providing basic information. So, it means we have very limited
number of open file descriptors on all DBS servers (only 1024, the default
value).

With this basic info/setup you should perform stress tests of DBS server and
monitor during these tests how file descriptors will raise for DBS Global
reader server. To do so you need stress tools and procedure. Here they are:

- generate a list of DBS urls representing different clients. You can make it
via dasgoclient tool, e.g.

# get list of datasets
dasgoclient -query="dataset=/Z*/*/*" > datasets.txt

# convert list of datasets into DBS urls, you can use whatever URL you need
# to create DBS URLs, e.g. cmsweb production, testbed or private VMs
cat datasets.txt | awk '{print "https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datasets?dataset="$1""}' > dbs_urls.txt

- now with list of fully-qualified DBS urls use hey tool to generate distributed load on
  DBS server, e.g.

# run hey_linux benchmark tool with generated list of dbs urls
# I like hey tool since it can take external files with urls and use it in tests
# it also allows to use client certificate to satisfy our frontend auth policy
# but you can use apache ab tool as well if you know how to tweak it
~/public/hey_linux -U ./dbs_urls.txt

# or adjust it with option to have more concurrent clients, e.g.
# run 200 requests with 100 concurrent clients
~/public/hey_linux -U ./dbs_urls.txt -n 200 -c 100

- while running aforementioned tests, open up DBS server window and monitor
number of open file descriptor, e.g. something with bash script
do; then
  sudo lsof -a -p PID
done

Repeat steps above by increasing the load on DBS (number of requests) such that
it should saturate the DBS servers. For instance if you reach 1024 open file
descriptors try more requests.

I used only basic DBS queries (urls) which usually very fast, but you may get
more sophisticated DBS queries which takes longer and therefore will force
DBS servers to keep open more file descriptors.

This is a very basic strategy to identify and debug the problem by simulating it on
DBS server. Of course you don't want to do it on production servers, but it can
be easily done on dedicated VM/DBS instance to simulate the problem. I think
Lina and Yuyi can easily come up with such setup.

If during this exercise you'll reproduce the behavior you can try to "solve" it
by increasing number of file descriptors at system level.

I rather prefer to see this as open DBS issue and report on what has been done
during such simulations instead of exchanging private emails.

Best,
Valentin.

On  0, Lina Marcela Jimenez Becerra <lina.marcela.jimenez.becerra@cern.ch> wrote:
Hello Yuyi, All,
below detailed output of the number of opened fd by each current BD's process, in all productions backends where the service is hosted.
Valentin thanks for the link reference, I used command suggested there.
[cmsweb@vocms0136 ~]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 28963
DBSGlobalReader is RUNNING, PID 28971
DBSGlobalWriter is RUNNING, PID 28979
DBSPhys03Reader is RUNNING, PID 28987
DBSPhys03Writer is RUNNING, PID 28999
DBSMigrate is RUNNING, PID 29009
[cmsweb@vocms0136 ~]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0136 ~]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 28963 #opened fd:
83
PID 28971 #opened fd:
126
PID 28979 #opened fd:
85
PID 28987 #opened fd:
86
PID 28999 #opened fd:
85
PID 29009 #opened fd:
99
[cmsweb@vocms0161 ~]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 2235
DBSGlobalReader is RUNNING, PID 2243
DBSGlobalWriter is RUNNING, PID 2251
DBSPhys03Reader is RUNNING, PID 2259
DBSPhys03Writer is RUNNING, PID 2267
DBSMigrate is RUNNING, PID 2275
[cmsweb@vocms0161 ~]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0161 ~]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 2235 #opened fd:
83
PID 2243 #opened fd:
98
PID 2251 #opened fd:
85
PID 2259 #opened fd:
86
PID 2267 #opened fd:
86
PID 2275 #opened fd:
100
[cmsweb@vocms0163 ~]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 16427
DBSGlobalReader is RUNNING, PID 16435
DBSGlobalWriter is RUNNING, PID 16443
DBSPhys03Reader is RUNNING, PID 16451
DBSPhys03Writer is RUNNING, PID 16459
DBSMigrate is RUNNING, PID 16467
[cmsweb@vocms0163 ~]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0163 ~]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 16427 #opened fd:
83
PID 16435 #opened fd:
126
PID 16443 #opened fd:
85
PID 16451 #opened fd:
85
PID 16459 #opened fd:
86
PID 16467 #opened fd:
100
[cmsweb@vocms0165 ~]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 19134
DBSGlobalReader is RUNNING, PID 19142
DBSGlobalWriter is RUNNING, PID 19150
DBSPhys03Reader is RUNNING, PID 19158
DBSPhys03Writer is RUNNING, PID 19166
DBSMigrate is RUNNING, PID 19174
[cmsweb@vocms0165 ~]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0165 ~]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 19134 #opened fd:
83
PID 19142 #opened fd:
101
PID 19150 #opened fd:
86
PID 19158 #opened fd:
85
PID 19166 #opened fd:
86
PID 19174 #opened fd:
99
Additionally, below the limit for opened files in each backed:
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "ulimit -a | grep open" -u "cmsweb"
vocms0136:
open files                      (-n) 1024
vocms0161:
open files                      (-n) 1024
vocms0163:
open files                      (-n) 1024
vocms0165:
open files                      (-n) 1024
Best Regards, Lina.
________________________________________
From: Yuyi Guo [yuyi@fnal.gov]
Sent: Tuesday, February 19, 2019 7:12 PM
To: Valentin Y Kuznetsov
Cc: stefano.belforte@gmail.com; Lina Marcela Jimenez Becerra
Subject: Re: global DBS instancr not responding
Valentin,
Yes, we need to check the numbers you mentioned. I cannot log in to DBS server nodes. We need to wait for Lina to get time to check this.
So, Lina, please let us know your finding.
Thanks,
Yuyi
On 2/19/19, 12:06 PM, "Valentin E. Kuznetsov" <vkuznet@gmail.com> wrote:
     Yuyi,
     The 502 Bad Gateway server error response code indicates that the server, while acting as a gateway or proxy, received an invalid response from the upstream server.
     The 503 Service Unavailable server error response code indicates that the server is not ready to handle the request.
     So regardless of who made calls to DBS it seems that upstream server (DBS) was
     not able to serve the requests. We back to square one on this question. Why?
     Did you check number of open file descriptors? How many of those allocated
     by DBS server and how many by ORACLE. Was it DBS server which runs out of
     available file descriptors or underlying DB?
     V.
     On  0, Yuyi Guo <yuyi@fnal.gov> wrote:
     > Hi All,
     >
     > I checked the front end logs and indeed there were a lot of 502/503 errors there. However, it seemed that the sever recovered before the second restarting at 18/Feb/2019 11:05:54. Most of the server 502 errors related to calls from dasgoserver.
     >
     > When a call got the 502/503 error, DBS server would not record this call.  There were some calls made to DBS during the trouble period, so DBS server log looked normal.  Most of the calls made to DBS had http 200 code, a few of them had time out error.
     >
     > Lina, please let us know what your finding on the number of open files.
     >
     > Thanks,
     > Yuyi

h4d4 commented 5 years ago

Hello all,

Sorry for have not been clear, I need to make some clarifications regarding provided value for 'limit for opened files in each backed'[0], this is the default limit value for 'file to open by system calls' for each whole node.

And that default value is set to '1468454' as is shown below, this configuration is already set up on the nodes.

[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "cat /proc/sys/fs/file-max" -u "cmsweb"
vocms0136:
1468646
vocms0161:
1468454
vocms0163:
1468464
vocms0165:
1468454
[lijimene@aiadm87 ~]$

Also a recalling, DBS in hosted in general propose backends, this means that other services are also hosted there[1], as for example crabserver and Reqmgr2.

Yuyi @yuyiguo , Valentin @vkuznet, I'd like to know if at the moment you consider that we should increase the above value. We have an old email thread about this same Error, the email subject is: 'October release of DBS'.

Regarding the proposed tests, let me review this, I think that this should be run first on testbed, what do you think?

Best Regards, Lina.

[0]
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "ulimit -a | grep open" -u "cmsweb"
vocms0136:
open files (-n) 1024
vocms0161:
open files (-n) 1024
vocms0163:
open files (-n) 1024
vocms0165:
open files (-n) 1024

[1] https://cms-http-group.web.cern.ch/cms-http-group/activity.html

vkuznet commented 5 years ago

Lina, the /rpco/sys/fs/file-max defines a limit on total number of file descriptors we can have on a system, while ulimit shoes soft/hard limits each individual user may have. I think what we need is to increase soft/hard limits for _dbs user who runs DBS services.

Please read about soft/hard/system limits here: https://unix.stackexchange.com/questions/84227/limits-on-the-number-of-file-descriptors and how to change them for particular users here: https://cs.uwaterloo.ca/~brecht/servers/openfiles.html

It seems to me that soft/hard limits for _dbs users should be increased to something like 65535. But it would be nice to find out actual limit through simulation I proposed, i.e. watch how much file descriptors DBS uses with increased load.

Regarding tests, as I wrote it should be done on private VMs and not on production nodes. Since we use testbed for other purposes it is also not a good candidate for tests. But you can change first file descriptors limits on testbed.

Best, Valentin.

On 0, Lina Jiménez notifications@github.com wrote:

Hello all,

Sorry for have not been clear, I need to make some clarifications regarding provided value for 'limit for opened files in each backed'[0], this is the default limit value for 'file to open by system calls' for each whole node.

And that default value is set to '1468454' as is shown below, this configuration is already set up on the nodes.
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "cat /proc/sys/fs/file-max" -u "cmsweb"
vocms0136:
1468646
vocms0161:
1468454
vocms0163:
1468464
vocms0165:
1468454
[lijimene@aiadm87 ~]$ 
Also a recalling, DBS in hosted in general propose backends, this means that other services are also hosted there[1], as for example crabserver and Reqmgr2.

Yuyi @yuyiguo , Valentin @vkuznet, I'd like to know if at the moment you consider that we should increase the above value. We have an old email thread about this same Error, the email subject is: 'October release of DBS'.

Regarding the proposed tests, let me review this, I think that this should be run first on testbed, what do you think?

Best Regards, Lina.
[0]
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "ulimit -a | grep open" -u "cmsweb"
vocms0136:
open files (-n) 1024
vocms0161:
open files (-n) 1024
vocms0163:
open files (-n) 1024
vocms0165:
open files (-n) 1024
[1] https://cms-http-group.web.cern.ch/cms-http-group/activity.html

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-465596514

yuyiguo commented 5 years ago

Thanks Lina for finding out more data. I also appreciate Valentin's explanation. Yes, please increase the number for _dbs first while we are preparing the test. I feel it is better to run the simulation test on cmsweb testbed because it is very close to prod. We may run the test between the releases while people are not testing against testbed.

h4d4 commented 5 years ago

Thanks @vkuznet , you rigth with the explanation.

For the increasing, I've set up the value as sugested, I've also used this reference[0]

This is the value before changes:

[lijimene@aiadm02 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files                      (-n) 1024
vocms0161:
open files                      (-n) 1024
vocms0163:
open files                      (-n) 1024
vocms0165:
open files                      (-n) 1024

And this is after changes:

[lijimene@aiadm02 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files                      (-n) 65535
vocms0161:
open files                      (-n) 65535
vocms0163:
open files                      (-n) 65535
vocms0165:
open files                      (-n) 65535

I've set up the value into file /etc/security/limits.conf as references sugested. I tested in ad dev node and the value is kept after rebooting.

En reference to the tests, @yuyiguo yes testbed provides a closer environment to production, for example, frontend and backend software stuff are in differentes nodes, such in production. While in dev, both frontend and backend are in the same node. Current testbed phase will be finish for 5th of March, we could use this without affecting current testing release.

Best Regars, Lina. [0] https://easyengine.io/tutorials/linux/increase-open-files-limit

h4d4 commented 5 years ago

@yuyiguo @vkuznet

Hello All, below a service's behaviour report regarding latest issue: 'DBS stuck again' reported on 'cms-service-webtools' The last restart that I did to the service was today(Monday 25th of February), as below:

around 11:47h GVA time in node vocms0136. there the amount of opened files was higher('12160' ) for DBSGlobalReader '24737', compared with the others nodes[1], but it was under the set up limit('65535')[2]. In [0] full output.
around 13:06h GVA time in the rest of the nodes in [1] the amount of opened files

My first thought is that the high amount of opened files on vomcs0136 was not related with this stuck. I'll coming back with more info from the logs.

Best Regards, Lina. [0]

[cmsweb@vocms0136 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 24729
DBSGlobalReader is RUNNING, PID 24737
DBSGlobalWriter is RUNNING, PID 24745
DBSPhys03Reader is RUNNING, PID 24753
DBSPhys03Writer is RUNNING, PID 24761
DBSMigrate is RUNNING, PID 24769
[cmsweb@vocms0136 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0136 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 24729 #opened fd:
83
PID 24737 #opened fd:
12160
PID 24745 #opened fd:
86
PID 24753 #opened fd:
84
PID 24761 #opened fd:
84
PID 24769 #opened fd:
79
[cmsweb@vocms0136 dbs]$

[1]

[cmsweb@vocms0161 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 7172
DBSGlobalReader is RUNNING, PID 7180
DBSGlobalWriter is RUNNING, PID 7188
DBSPhys03Reader is RUNNING, PID 7196
DBSPhys03Writer is RUNNING, PID 7204
DBSMigrate is RUNNING, PID 7212
[cmsweb@vocms0161 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0161 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 7172 #opened fd:
83
PID 7180 #opened fd:
11939
PID 7188 #opened fd:
85
PID 7196 #opened fd:
84
PID 7204 #opened fd:
85
PID 7212 #opened fd:
79
[cmsweb@vocms0161 dbs]$ 

[cmsweb@vocms0165 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 12715
DBSGlobalReader is RUNNING, PID 12723
DBSGlobalWriter is RUNNING, PID 12731
DBSPhys03Reader is RUNNING, PID 12739
DBSPhys03Writer is RUNNING, PID 12747
DBSMigrate is RUNNING, PID 12755
[cmsweb@vocms0165 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0165 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 12715 #opened fd:
83
PID 12723 #opened fd:
11684
PID 12731 #opened fd:
85
PID 12739 #opened fd:
84
PID 12747 #opened fd:
86
PID 12755 #opened fd:
79
[cmsweb@vocms0165 dbs]$

[2]

[lijimene@aiadm90 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files                      (-n) 65535
vocms0161:
open files                      (-n) 65535
vocms0163:
open files                      (-n) 65535
vocms0165:
open files                      (-n) 65535
[lijimene@aiadm90 ~]$

vkuznet commented 5 years ago

Lina, that's great info. For the first time we see the following:

DBSGlobalReader has high number of open file descriptors, at the level of O(10K)
since you recently raised threshold for number of open fds from 1024 to 65535 the DBS instance didn't exceeded by the time you checked in. I bet if the threshold will be lower then we'll got service stuck because of this error

Yuyi, I think it is time for you to investigate why DBS global reader has that many open fds? what are they? connection to Oracle? You can look at fds by doing listing ls -al /proc//fd and correlate this info with lsof output For instance, here is an example of fds for some of my process

sudo ls -al /proc/20030/fd
total 0
dr-x------. 2 prometheus prometheus  0 Feb 21 22:20 .
dr-xr-xr-x. 9 prometheus prometheus  0 Feb 21 22:20 ..
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 0 -> /dev/null
lrwx------. 1 prometheus prometheus 64 Feb 21 22:20 1 -> socket:[56351]
lrwx------. 1 prometheus prometheus 64 Feb 21 22:20 10 -> socket:[56403]
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 11 -> /var/lib/prometheus/01D4FBQ5DD9D78378NBNW3TP41/chunks/000001
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 12 -> /var/lib/prometheus/01D4BG431ZHKD38KSTQZB0X3XB/index
...

It has socket fd's and fd's pointing to /var/lib area which correspond to its open files used by the process.

We need to start looking into what DBSGlobalReader is doing and understand why it has so many fds. Again I don't know what it should have and what is "large" for it.

I finally got prometheus code to monitor this basic info and I'll send another email with explanation.

Best, Valentin.

On 0, Lina Jiménez notifications@github.com wrote:

@yuyiguo @vkuznet

around 11:47h GVA time in node vocms0136. there the amount of opened files was higher('12160' ) for DBSGlobalReader '24737', compared with the others nodes[1], but it was under the set up limit('65535')[2]. In [0] full output.
around 13:06h GVA time in the rest of the nodes in [1] the amount of opened files

My first thought is that the high amount of opened files on vomcs0136 was not related with this stuck. I'll coming back with more info from the logs.

Best Regards, Lina. [0]

[cmsweb@vocms0136 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 24729
DBSGlobalReader is RUNNING, PID 24737
DBSGlobalWriter is RUNNING, PID 24745
DBSPhys03Reader is RUNNING, PID 24753
DBSPhys03Writer is RUNNING, PID 24761
DBSMigrate is RUNNING, PID 24769
[cmsweb@vocms0136 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0136 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 24729 #opened fd:
83
PID 24737 #opened fd:
12160
PID 24745 #opened fd:
86
PID 24753 #opened fd:
84
PID 24761 #opened fd:
84
PID 24769 #opened fd:
79
[cmsweb@vocms0136 dbs]$

[1]

[cmsweb@vocms0161 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 7172
DBSGlobalReader is RUNNING, PID 7180
DBSGlobalWriter is RUNNING, PID 7188
DBSPhys03Reader is RUNNING, PID 7196
DBSPhys03Writer is RUNNING, PID 7204
DBSMigrate is RUNNING, PID 7212
[cmsweb@vocms0161 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0161 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 7172 #opened fd:
83
PID 7180 #opened fd:
11939
PID 7188 #opened fd:
85
PID 7196 #opened fd:
84
PID 7204 #opened fd:
85
PID 7212 #opened fd:
79
[cmsweb@vocms0161 dbs]$ 

[cmsweb@vocms0165 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 12715
DBSGlobalReader is RUNNING, PID 12723
DBSGlobalWriter is RUNNING, PID 12731
DBSPhys03Reader is RUNNING, PID 12739
DBSPhys03Writer is RUNNING, PID 12747
DBSMigrate is RUNNING, PID 12755
[cmsweb@vocms0165 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0165 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 12715 #opened fd:
83
PID 12723 #opened fd:
11684
PID 12731 #opened fd:
85
PID 12739 #opened fd:
84
PID 12747 #opened fd:
86
PID 12755 #opened fd:
79
[cmsweb@vocms0165 dbs]$

[2]

[lijimene@aiadm90 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files                      (-n) 65535
vocms0161:
open files                      (-n) 65535
vocms0163:
open files                      (-n) 65535
vocms0165:
open files                      (-n) 65535
[lijimene@aiadm90 ~]$

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467027300

yuyiguo commented 5 years ago

@h4d4 Lina, Thank you so much for such detailed info. I have one question, When you were checking DBS while people reporting DBS stuck, You found the read server on 0136 was not responding? You did not see it handling queries in the log file, but the rest three readers were fine?

The 0136 reader had 121610 open files, the other two has 11939 and 11684 ( we missed 0163). To me 0136 did not have significant more opened files than the other two.

@vkuznet To answer you questions why the reader has much higher open files, it is because the readers are the most accessed. I will follow up with your suggestions. After that I will check if there is any leaking in DBS code. From the email I got, the earliest report of DBS stuck was 6:05 am CT Feb 25 .

Lina, Would you please give me access to cmsweb prod dbs backends so I can check the ope fds?

Thanks, Yuyi

h4d4 commented 5 years ago

@yuyiguo Yuyi,

would you please try to access any of the nodes vocms01[36,61,63,65], and let me know how it is going? 'yuyi' user is already into interactive users for those nodes[1].

Best Regards, Lina. [0] https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsweb/blob/master/data/fqdns/vocms0165.cern.ch.yaml

yuyiguo commented 5 years ago

@h4d4 Lina, Just tried. I was able to access 0136/63/65, but not 61. Thanks, Yuyi

h4d4 commented 5 years ago

@yuyiguo I'm going to check 0161. Regarding this question "When you were checking DBS while people reporting DBS stuck, You found the read server on 0136 was not responding":

DBSGlobalReader was responding queries just before I restarted it, please see below logs, from bottom to top.

INFO:cherrypy.access:[25/Feb/2019:11:45:41] vocms0136.cern.ch 131.169.77.46 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: 
- in 151500905 out 105780778 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:42] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/filesummaries/?dataset=%2FDoubleEG%2FRun2016B-03Feb2017_ver2-v2%2FMINIAOD&validFileOnly=1 HTTP/1.1" 200 OK [data: - in 108 out 126402 us ] [auth: OK "/DC=ch/DC=cern/OU
=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:45] vocms0136.cern.ch 131.169.164.18 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 105549251 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:47] vocms0136.cern.ch 131.169.84.103 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 108557093 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:54] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FMuonEG%2FRun2018D%2A%2F%2A&dataset_access_type=VALID&detail=True HTTP/1.1" 200 OK [data: - in 4896 out 72884 us ] [auth: OK "/DC=ch/DC=cern/OU=Org
anic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:56] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 83410 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:34] vocms0136.cern.ch 131.169.77.62 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: 
- in 151500905 out 116154577 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:36] vocms0136.cern.ch 131.169.163.137 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 122021117 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:42] vocms0136.cern.ch 131.169.163.128 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 112493682 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:cherrypy.error:Mon Feb 25 10:46:48 2019 dbsException-invalid-input2: '/mc-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /m
c-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258) cannot pass input checking
INFO:cherrypy.access:[25/Feb/2019:11:46:48] vocms0136.cern.ch 2001:1458:d00:14::8b "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2Fmc-physval-120-DrellYan-LowLumi%2FFEVT%2FCMSSW_1_2_0-FEVT-1166809258&dataset_access_type=%2A&detail=True HTTP/1.1" 400 Bad Request [data: - in 113
 out 13535 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:07] vocms0136.cern.ch 131.169.84.186 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 105578002 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:09] vocms0136.cern.ch 131.169.164.34 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 101181541 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:12] vocms0136.cern.ch 131.169.163.128 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 98030344 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:16] vocms0136.cern.ch 131.169.77.17 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: 
- in 151500905 out 103407271 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:18] vocms0136.cern.ch 131.169.77.226 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 101622105 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:cherrypy.error:Mon Feb 25 10:47:21 2019 dbsException-invalid-input2: '/mc-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /m
c-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258) cannot pass input checking
INFO:cherrypy.access:[25/Feb/2019:11:47:21] vocms0136.cern.ch 2001:1458:d00:14::8b "POST /dbs/prod/global/DBSReader/datasetlist HTTP/1.1" 400 Bad Request [data: - in 114 out 25655 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez
 Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:29] vocms0136.cern.ch 131.169.84.145 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 106075925 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:32] vocms0136.cern.ch 131.169.77.6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: -
 in 151500905 out 108741553 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:33] vocms0136.cern.ch 131.169.77.6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: -
 in 151500905 out 107828326 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:36] vocms0136.cern.ch 131.169.77.215 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 109157985 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:38] vocms0136.cern.ch 131.169.77.76 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: 
- in 151500905 out 108701256 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:46] vocms0136.cern.ch 131.169.84.254 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 107245343 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:47] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 32045 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:48] vocms0136.cern.ch 131.169.163.112 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 109296392 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:49] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FHighMultiplicity%2FRun2015D-v1%2FRECO&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: - in 2 out 73047 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic 
Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:57] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 67963 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:00] vocms0136.cern.ch 2001:1458:d00:14::8b "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FMinBias_Hydjet_Drum5F_2018_5p02TeV%2FHINPbPbAutumn18GS-103X_upgrade2018_realistic_HI_v11-v1%2FGEN-SIM&detail=True HTTP/1.1" 200 OK [data: - in 669
 out 63482 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:03] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FVBFHToTauTau%2A%2F%2A18%2A%2FMINIAODSIM&dataset_access_type=VALID&detail=True HTTP/1.1" 200 OK [data: - in 7183 out 272482 us ] [auth: OK "/DC=ch/
DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:40] vocms0136.cern.ch 131.169.77.129 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
 - in 151500905 out 114096627 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:root:Missing componentDir and General section in config
ERROR:root:Going to trust you to know what you're doing.
CherryPy Checker:
The Application mounted at u'/dbs/prod/global/DBSReader' has an empty config.

h4d4 commented 5 years ago

@yuyiguo Yuyi,

now username 'yuyi' should have accesses to vocms0161. Let me know in case that you need root privileges.

Best Regards, Lina.

belforte commented 5 years ago

One CRAB user reported a problem [1] with crabcache service (which does not talk to DBS at all) at same time as this was ongoing, I could reproduce it (repeatedly) later on both on cmsweb and cmsweb-testbed. [2] The problem present itself as an odd error inside cherrypy leading to HTTP 500. The problem is gone today that also DBS work. I did not investigate in great details, only thing I can say is that simple queries to crabcache worked, while the failing ones all required a largish (tens of MB's) file to be uploaded to the cache, note that we will never upload anything larger than 100MB. I can not say it this is accidental correlation, if there was a common cause, or if simply DBS problem was creating so many long-running cmsweb FE/BE connections to make FE->crabcache slow down to the point of triggering a poorly handled error. I post this information in case it may help to hunt for the problem.

[1] https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4624.html https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4629/2/1.html

[2] from log on vocms055:

belforte@vocms055/~> grep  -A13 belfor /build/srv-logs/vocms0741/crabcache/crabcache-20190224.log
[24/Feb/2019:22:03:48] vocms0741.cern.ch 131.225.189.232 "PUT /crabcache/file HTTP/1.1" 200 OK [data: 41905410 in 115 out 922477150 us ] [auth: OK "/DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Stefano Belforte belforte@infn.it" "" ] [ref: "" "PycURL/7.43.0.1 libcurl/7.59.0 OpenSSL/1.0.2d zlib/1.2.8" ]
[24/Feb/2019:22:03:48]  Traceback (most recent call last):
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 180, in trap
    return func(*args, **kwargs)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 99, in __call__
    return self.nextapp(environ, start_response)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 411, in tail
    return self.response_class(environ, start_response, self.cpapp)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 266, in __init__
    self.close()
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 288, in close
    if streaming and is_closable_iterator(self.iter_response):
AttributeError: 'AppResponse' object has no attribute 'iter_response'

the client log for the failed command shows that it was executing e.g. (client ran on a different time zone, and I have multiple such messages, ignore the time stamp)

DEBUG 2019-02-24 14:48:20,126:   Uploading archive /uscms_data/d2/belforte/crab_20190224_144747/inputs/3710c06d-8d32-4e33-a3b3-03d57d868681default.tgz (39 MB) to the CRAB cache. Using URI https://cmsweb.cern.ch/crabcache
INFO 2019-02-24 15:03:48,795:    The server answered with an error.
DEBUG 2019-02-24 15:03:48,795:   ERROR: Internal Server Error (500):  'Unrecoverable error in the server.
Traceback (most recent call last):
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 180, in trap
    return func(*args, **kwargs)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 99, in __call__
    return self.nextapp(environ, start_response)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 411, in tail
    return self.response_class(environ, start_response, self.cpapp)
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 266, in __init__
    self.close()
  File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 288, in close
    if streaming and is_closable_iterator(self.iter_response):
AttributeError: 'AppResponse' object has no attribute 'iter_response'

DEBUG 2019-02-24 15:03:48,796:   Command failed with URI: https://cmsweb.cern.ch:8443/crabcache/file
DEBUG 2019-02-24 15:03:48,796:        Input data: [('inputfile', (10, '/uscms_data/d2/belforte/crab_20190224_144747/inputs/3710c06d-8d32-4e33-a3b3-03d57d868681default.tgz')), ('hashkey', '67a942612289077c72134faf4b88f5ca03fcc1fbc8f6b5b39c6eff423d939e64'), ('newchecksum', '2')]
DEBUG 2019-02-24 15:03:48,796:        Request headers: {'Content-Length': '1113', 'Set-Cookie': 'cms-auth=a11ac7c1bc76ff625d4c1219254a94a3a26c81b21fc0133faa67785f76ea2985dbbf0e83188563db;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Server': '0.0.0.0', 'Connection': 'close', 'Date': 'Sun, 24 Feb 2019 21:03:48 GMT', 'Content-Type': 'text/plain', 'CMS-Server-Time': 'D=922820160 t=1551041305918821'}

belforte commented 5 years ago

@h4d4 do I understand correctly that in the end the problem was "solved" by simply restarting all of DBS BackEnds sort of "yet another time" even if you did that already during the weekend ?

h4d4 commented 5 years ago

@belforte Stefano, yes it was exactly what happened.

belforte commented 5 years ago

Thanks Lina.

I have no idea if it will help, or making things work, but do you and Valentin want to follow up on the idea to enable others to issue that restart, in case it is all that's needed ? Tommaso asked whether CRC's could/should have the needed rights, but when you restarted and it did not work, I did not pursue this. In any case, even if "us past CRC's like Tom and myself" had the rights, I am not sure what the right way to do is nor would tamper with CMSWEB unless instructed to do so.

yuyiguo commented 5 years ago

Yes, Lina I can access 0161 now.

yuyiguo commented 5 years ago

@h4d4 @belforte Lina, Stefano, I am a bit confused about what you two discussed. Lina showed that before she restarted DBS servers, DBS reads already handled requests. Stefano found Crab could not connect to DBS at all and crab started to connect to DBS only after the second DBS restart this morning. What I missed here?

belforte commented 5 years ago

I am similarly confused. And I still encourage not to consider this a CRAB problem. I repeatedly tried a simple dasgoclient command and it was hanging. At least this was the case Sunday. I did not check between the two restarts (one server only first time, all of them second time, IIUC) this morning since it was night for me. I used dasgoclient since I hope it is easier for everybody else to do the same and verify if DBS is answering or not on CMSWEB w/o asking CRAB people, I can surely use direct DBS Client as well...

h4d4 commented 5 years ago

@yuyiguo @belforte
What I meant to say, is that yes, the Issue reported by users about failed connections to DBS was solved after today restarted. And the point is that the service was handled requests.

vkuznet commented 5 years ago

Stefano, if I recall correctly the CMS operator on duty can restart cmsweb data-services. Did something changed? I think what's is missing is proper monitoring and alerting that particular data-service is not responding. The way we monitor cmsweb services is via ServerMonitor daemon which runs on every node and perform an action based on service monitoring script. For DBS the service monitoring script is based on context of this file: https://github.com/dmwm/deployment/blob/master/dbs/monitoring.ini and what it only pings DBS server at this URL http://localhost:8250/dbs/

My bet that many issues with DBS service may not be caught via this ping command, e.g. the DBS HTTP server may be fine, but underneath it can struggle with particular requests. I think the issue here is to define proper command against DBS server to check the health of the system. I think it is better to ping particular DBS query, e.g. http://localhost:8250/dbs/prod/global/DBSReader/datasets?dataset=/a/b/c (with valid dataset name) which will cause DBS server to perform the query. If such command failed or stalled the ServerMonitor daemon will notice and issue proper alert.

Best, Valentin.

On 0, Stefano Belforte notifications@github.com wrote:

Thanks Lina.

I have no idea if it will help, or making things work, but do you and Valentin want to follow up on the idea to enable others to issue that restart, in case it is all that's needed ? Tommaso asked whether CRC's could/should have the needed rights, but when you restarted and it did not work, I did not pursue this. In any case, even if "us past CRC's like Tom and myself" had the rights, I am not sure what the right way to do is nor would tamper with CMSWEB unless instructed to do so.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467144509

h4d4 commented 5 years ago

@yuyiguo Another comment, sorry if this confused more, but is just to know. An user reported[0] this type of error

And looking into the logs this is treated as below, that is an client side error HTTP 400: ERROR:cherrypy.error:Mon Feb 25 12:12:24 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking

Best Regards, Lina. [0] https://cern.service-now.com/nav_to.do?uri=%2Fincident.do%3Fsys_id%3D22eb0571dbfbaf40ae659ec4db96196a%26sysparm_stack%3Dincident_list.do%3Fsysparm_query%3Dactive%3Dtrue [1]

match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /GluGluHToZZTo4L_M125_14TeV_powheg2_JHUgenV702_pythia8/PhaseIIFall16DR82-NoPU_90X_upgrade2023_realistic_v1-v1/GEN-SIM-RECO#80674502-e1b6-11e6-933c-002590494c06) cannot pass input checking
./DBSGlobalReader-20190225.log:8953:ERROR:cherrypy.error:Mon Feb 25 11:38:23 2019 dbsException-invalid-input2: '/Tau/*/*/*' does not match regular expression ^/(\*|[a-zA-Z\*][a-zA-Z0-9_\*\-]{0,100})(/(\*|[a-zA-Z0-9_\.\-\*]{1,199})){0,1}(/(\*|[A-Z\-\*]{1,50})){0,1}$ key-value pair (dataset, /Tau/*/*/*) cannot pass input checking
./DBSGlobalReader-20190225.log:9367:ERROR:cherrypy.error:Mon Feb 25 11:57:54 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking
./DBSGlobalReader-20190225.log:11856:ERROR:cherrypy.error:Mon Feb 25 12:12:24 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking

belforte commented 5 years ago

@h4d4 this is fine. The user did a mistake and the server is telling him/her so. i.e. working as designed.

belforte commented 5 years ago

I clsoed that SNOW ticket. CMS users should report problem to our HN, not make noise with multiple unappropriate report channels. Actually they should read our documentation, so they will know how to use services and how to report problems. Lina, please do not waste your time on those things.

belforte commented 5 years ago

@vkuznet about restart by operator I suspect that you are confusing CERN IT operators with CMS people. Indeed CERN IT on duty operators receives Lemon alarms (like swap full, or the process not running that you indicate) and in cases like DBS where CMS provided proper information, can issue some simple commands to restart things. As you say, when processes are running, yet not doing what we like, it is for CMS people to make a call on what to do. Currently only Lina acts in that situation, which is putting her on a sort of 24x7 on-call duty which is really bad for her, and insufficient for us. The question was whether people like me, Tommaso, Dyma, Stephan,.... can help. We have O(10) experienced people in this CMS Computing Run Coordinator list which maybe can also restart the service (but I am not sure if we have the proper authorizations nor the instructions). Most of those people have no understanding of how things work (think of me e.g.) but can be somehow trusted not to do things foolishly and at least to document and report what they do, if they ever do. SImply a way to help cover during off hours.

Hope I made myself more clear, sorry for hijacking the Issue.

vkuznet commented 5 years ago

I think you partially answered your question. If we grant CERN IT operators permissions to restart our service I don't see any reason why we can't grant such rights to CMS experts. The only question is a policy which should define

who are considered to be CMS expert
how rights should be handled (issued and revoked) and by whom
which privilege to give to such people

I think it should be management decision and not cmsweb one. The rights comes with responsibilities and if decided to issue those rights at least L1/L2's should agree on that.

Regarding the instructions, we do have common command for ALL cmsweb services, i.e. our procedure is standardized. The only command anyone who deal with cmsweb needs to know is how to use manage scripts. I think we provide this instructions to CERN IT and the same instructions can be used by CMS experts. Lina knows where they are.

Best, Valentin.

On 0, Stefano Belforte notifications@github.com wrote:

@vkuznet about restart by operator I suspect that you are confusing CERN IT operators with CMS people. Indeed CERN IT on duty operators receives Lemon alarms (like swap full, or the process not running that you indicate) and in cases like DBS where CMS provided proper information, can issue some simple commands to restart things. As you say, when processes are running, yet not doing what we like, it is for CMS people to make a call on what to do. Currently only Lina acts in that situation, which is putting her on a sort of 24x7 on-call duty which is really bad for her, and insufficient for us. The question was whether people like me, Tommaso, Dyma, Stephan,.... can help. We have O(10) experienced people in this CMS Computing Run Coordinator list which maybe can also restart the service (but I am not sure if we have the proper authorizations nor the instructions). Most of those people have no understanding of how things work (think of me e.g.) but can be somehow trusted not to do things foolishly and at least to document and report what they do, if they ever do. SImply a way to help cover during off hours.

Hope I made myself more clear, sorry for hijacking the Issue.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467178246

belforte commented 5 years ago

@h4d4 about "was it running or was it not running when restarted" seems to me that the sequence of events is:

DBS hangs
you get mail on saturday and restart it (I do not know if all servers, or one)
it still does not work
on monday around 9am CERN time you restart vocms0136
on monday around 12 (13?) CERB time you restart all DBS backends
you say that logs shows that at 5. DBS was answering queries (did I get it right ?)

so the question is whether 4. solved and you did 5. because someone was still, but wrongly, complaining. Or whether something else happened between 4. and 5. which made DBS (and crabcache !) work again. Or whether it simply needed some time. Or whether me or Yuyi did not understood correctly. Or whatever else.. but 6. points to "5. was not needed".

I was sleeping between 4. and 5. ! But can add these data points now from production CRAB TaskWorker logs : (note this is the CRAB thing which runs outside CMSWEB and queries it)

as of current time: Mon Feb 25 22:17:53 CET 2019
Last HTTP Error 502 from DBS : 2019-02-25 13:36:35,857
First successful DBS call of the day: 2019-02-25 05:42:59,300

Indeed the picture which I see now is that DBS was answering to just a tiny fraction of queries, until it started to work again. I do not know if it is important that I spend time trying to get error/success fraction from log sort of for every hour, but the following which lists the first 50 successful calls is quite telling:

belforte@crab-prod-tw01/logs> grep DBSDataDiscovery,267:Got twlog.txt|head -50
2019-02-25 05:42:59,300:DEBUG:DBSDataDiscovery,267:Got 92 files
2019-02-25 09:20:31,102:DEBUG:DBSDataDiscovery,267:Got 300 files
2019-02-25 10:58:45,801:DEBUG:DBSDataDiscovery,267:Got 10 files
2019-02-25 12:33:31,317:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:38:30,797:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:44:04,612:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:46:00,414:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:48:56,034:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:53:20,633:DEBUG:DBSDataDiscovery,267:Got 1023 files
2019-02-25 13:07:30,010:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:07:30,465:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:07:30,868:DEBUG:DBSDataDiscovery,267:Got 1 files
2019-02-25 13:07:48,899:DEBUG:DBSDataDiscovery,267:Got 5 files
2019-02-25 13:07:53,694:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:07:55,980:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:07:56,264:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:00,308:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:10,032:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:13,149:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:14,054:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:16,519:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:08:16,519:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:08:20,640:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:08:21,999:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:25,921:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:51,976:DEBUG:DBSDataDiscovery,267:Got 2522 files
2019-02-25 13:08:52,081:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:52,130:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:52,156:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:55,694:DEBUG:DBSDataDiscovery,267:Got 41 files
2019-02-25 13:09:01,663:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:01,663:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:05,099:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:06,112:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:06,908:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:09,069:DEBUG:DBSDataDiscovery,267:Got 2530 files
2019-02-25 13:09:10,067:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:18,327:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:22,252:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:22,462:DEBUG:DBSDataDiscovery,267:Got 2527 files
2019-02-25 13:09:44,935:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:47,016:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:09:47,041:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 13:10:28,604:DEBUG:DBSDataDiscovery,267:Got 7 files
2019-02-25 13:10:29,091:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:29,147:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:29,174:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:34,634:DEBUG:DBSDataDiscovery,267:Got 71 files
2019-02-25 13:10:34,634:DEBUG:DBSDataDiscovery,267:Got 71 files
2019-02-25 13:10:40,221:DEBUG:DBSDataDiscovery,267:Got 46 files
belforte@crab-prod-tw01/logs>

yuyiguo commented 5 years ago

Valentin suggested to test against the localhost and I was thinking on the same line. Some test against the local host directly using below query: [0] http://localhost:8250/dbs/prod/global/DBSReader/datatierss?data_tier=RAW We can use the same one against the front end too. [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatierss?data_tier=RAW

The order of the test should be [1], then [0] on each nodes.

vkuznet commented 5 years ago

Yuyi, the test against cmsweb will not work because of authentication, client ( in this case ServeMonitor daemon) needs support for x509 certificates , I doubt it can do that. But you can use dasgoclient to do the work, but in this case you need infrastructure to support x509 proxy and write your own monitoring daemon.

On Feb 25, 2019, at 17:15, Yuyi Guo notifications@github.com wrote:

Valentin suggested to test against the localhost and I was thinking on the same line. Some test against the local host directly using below query: [0] http://localhost:8250/dbs/prod/global/DBSReader/datatierss?data_tier=RAW We can use the same one against the front end too. [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatierss?data_tier=RAW

The order of the test should be [1], then [0] on each nodes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yuyiguo commented 5 years ago

Sorry, the correct urls are: [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW [0]http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW

I tried below on my VM: sudo -u root curl http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW {"exception": 403, "message": "You are not allowed to access this resource.", "type": "HTTPError"} I replaced root with _dbs, and others, I got the same error. What is needed to run the curl?

yuyiguo commented 5 years ago

I also tried use my cert that works with the front end, but not with the directly access to backend.

vkuznet commented 5 years ago

Yuyi, you don't need sudo for curl call. The later makes HTTP call to remote server, while former execute local UNIX process, so it is irrelevant in this case.

And, your issue comes from the fact that all of DBS APIs are secured under WMCore REST server, see https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/web/DBSReaderModel.py#L96

The authzfunc in your case is authInsert, see https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/web/DBSReaderModel.py#L54 which checks roles of incoming HTTP requests. The roles are setup by frontend server based on user certificate. Therefore in order to use curl call to localhost you must supply in your curl request the roles or create DBS un-secure method.

I don't recall from my head how roles are set in HTTP requests, but you can find out them either by inspecting frontend cmsauth module or use your DBS code and insert printouts and make curl call to https endpoint to see them.

yuyiguo commented 5 years ago

Thanks @vkuznet for the comments. I realized myself too this morning that the denying access is due to the role, not OS permission. You confirmed it. So looking into the code, I found a back door for the role [1]. In order to fake it, I need a new version of curl. So I did installation, then setup proper path and library. After all this, I was able to directly access the backend server while I am on the server node.

sudo -u root yum --enablerepo=cern install curl-openssl-7.57.0-1.1.el7.cern export PATH=/opt/shibboleth/bin:$PATH export LD_LIBRARY_PATH=/opt/shibboleth/lib64 curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW [{"create_by": "CMSDBS", "creation_date": 1394471870, "data_tier_id": 100, "data_tier_name": "RAW"}]

[1] https://github.com/dmwm/WMCore/blob/357520d553330071808e1453a89bcd4d4561d39d/src/python/WMCore/REST/Auth.py#L27

@h4d4 Could you do install curl-openssl on DBS prod and int nodes? We need to run this kind of test next time if we have situation like yesterday,

Thanks, Yuyi

vkuznet commented 5 years ago

Yuyi, I'm glad you figure it out, it's good to know the option we may explore for testing WMCore based services. Now, I suggest that you should adopt this query to ServerMonitor. If read ServerMonitor code [1] correctly it provides fake auth headers, including cms-auth-status, when it does its job. Therefore I think you should only need to change your deployment/dbs/monitoring.ini file and change

# The ping test fetches the provided URL and look for the following perl regex
PING_URL="http://localhost:8250/dbs/"
PING_REGEX="Welcome to the DMWM web framework"

# The ping test fetches the provided URL and look for the following perl regex
PING_URL="http://localhost:8250/dbs/<DBSINSTANCE>/global/DBSReader/datatiers?data_tier_name=RAW"
PING_REGEX=".*data_tier_id.*data_tier_name.*RAW"

But I think you need 3 different monitoring.ini files, one for dev, one for testebed and one for production. You'll also need to change DBSINSTANCE accordingly.

Since you need three monitoring.ini files you'll need

add template monitoring.ini file
change deploy script to adjust template monitoring.ini file to particular backend (see examples how we adjust backend.txt files for frontend [2]) and add similar logic your dbs/deploy file which will use proper monitoring.ini file with proper DBS instance.

Doing this way you can use datatiers DBS API to monitor DBS servers on all backends. The ServerMonitor will make this HTTP calls and if pattern will not match it will trigger alarm which goes to lemon.

Having proper version of curl on all backend nodes it helpful for manual debugging regardless of the service, therefore I would suggest to adjust all backend/frontend nodes to use newer version of curl.

The only question I have here is why do you need shibboleth path?

Best, Valentin.

[1] https://github.com/dmwm/deployment/blob/master/admin/ServerMonitor [2] https://github.com/dmwm/deployment/blob/master/frontend/deploy#L48

On 0, Yuyi Guo notifications@github.com wrote:

Thanks @vkuznet for the comments. I realized myself too this morning that the denying access is due to the role, not OS permission. You confirmed it. So looking into the code, I found a back door for the role [1]. In order to fake it, I need a new version of curl. So I did installation, then setup proper path and library. After all this, I was able to directly access the backend server while I am on the server node.

sudo -u root yum --enablerepo=cern install curl-openssl-7.57.0-1.1.el7.cern export PATH=/opt/shibboleth/bin:$PATH export LD_LIBRARY_PATH=/opt/shibboleth/lib64 curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW [{"create_by": "CMSDBS", "creation_date": 1394471870, "data_tier_id": 100, "data_tier_name": "RAW"}]

[1] https://github.com/dmwm/WMCore/blob/357520d553330071808e1453a89bcd4d4561d39d/src/python/WMCore/REST/Auth.py#L27

@h4d4 Could you do install curl-openssl on DBS prod and int nodes? We need to run this kind of test next time if we have situation like yesterday,

Thanks, Yuyi

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467557098

vkuznet commented 5 years ago

Yuyi, could you please edit original posting in this issue that it will be properly formatted. So far you cut and paste ASCII email I sent, here it is rendered as Markdown, therefore you should properly adjust/replace pound signs and put in triple quotes the commands I listed in email. So far it is really hard to read initial message.

yuyiguo commented 5 years ago

@vkuznet The default yum install put crul-openssl into /opt/shibboleth/bin. I don't know how to edit comments already published in github issues. If I cut past them as new, the time line will be wrong. I

belforte commented 5 years ago

@yuyiguo you can only edit comments which you wrote. Click on the three dots top right of the comment header, a menu wll pop up which includes Edit

vkuznet commented 5 years ago

Go to top of the page and click on three dots in your original post. Since you're the owner it will give you option "Edit", click on it and edit the post.

On 0, Yuyi Guo notifications@github.com wrote:

@vkuznet The default yum install put crul-openssl into /opt/shibboleth/bin. I don't know how to edit comments already published in github issues. If I cut past them as new, the time line will be wrong. I

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467600387

belforte commented 5 years ago

@yuyiguo the easier way will likely be to put three single quotes before and after the text that you pasted (in separate lines)

yuyiguo commented 5 years ago

Thanks @belforte @vkuznet updated with 3 single quotes.

yuyiguo commented 5 years ago

@amaltaro , Alan, I haven't heard you on the issue that DBS severed requests on and off over last weekend and even Monday morning. How was WM agents during that time? Did you get a lot of failed jobs? Thanks, Yuyi

amaltaro commented 5 years ago

The impact on WMAgent is not easily visible, and in general it keeps retrying until it succeeds. The consequences for the Central Production services might be (skipping unified): ) new workflows will failed to get splitted by global workqueue (not a big deal if you have a large backlog). It's a problem if you have super urgent small requests though ) workflows might failed to get acquired by the agents (local workqueue). Delaying the work acquisition process *) workflows might take longer to be announced, since DBS3Upload is probably failing to inject data into DBS.

So in summary, it's bad of course but not something that will cause failures requiring much more effort to sort them out (or recover whatever problems were caused).

yuyiguo commented 5 years ago

Thanks Alan. Do you or others have data to show at what time period over last weekend workflows experienced a lot failing to split, upload to DBS3 or any actions needed to contact to DBS? I am trying to figure out why some of the DBS clients could not connect to DBS after Lina restarted DBS during the weekend and some are OK.

amaltaro commented 5 years ago

Yes, it should be doable. Global workqueue isn't continuously splitting work because it depends on what workflows are put in the system. But maybe you can have an idea by looking at the global workqueue logs, for instance:

amaltaro@vocms055:/build/srv-logs/vocms0740 $ grep -r 'Proxy Error' workqueue/reqmgrInteractionTask-20190223.log | wc -l
19
amaltaro@vocms055:/build/srv-logs/vocms0740 $ grep -r 'Proxy Error' workqueue/reqmgrInteractionTask-20190224.log | wc -l
68

Otherwise I can also have a look at an agent or two.

belforte commented 5 years ago

On 27/02/2019 09:32, Alan Malta Rodrigues wrote:

*) new workflows will failed to get splitted by global workqueue (not a big deal if you have a large backlog).

Yeah.. which is all of the difference with CRAB where we get a continous stream of new small (by P&R standard) requests.

*) workflows might take longer to be announced, since DBS3Upload is probably failing to inject data into DBS.

maybe not, since writing is handlds by a different server instance.

Stefano

yuyiguo commented 5 years ago

Thanks Alan, I will look at the global workqueue logs.

yuyiguo commented 5 years ago

Hello Yuyi,

regarding this alarm saying that DBS was down in '01/Mar/2019' around  '05:26:28' GVA time (time in which ticket was opened).
And that therefore, the service was restarted around '05:31:32'

The 'correlation expression: (vocms0161:8065:1[5]_<_6)', that triggers the alarm says that at that moment only 5 of 6 process of the service where running. A process for each instance of the service.

But what I found into services instances logs, taking as reference this expression 
"ERROR:root:Missing componentDir and General section in config", 
(that as you know means that the service is being restarted), 
is that there are not signal of problems or errors with the queries. 

The only that I found weird is there into logs for '****** DBSGlobalReader-20190301.log'[0], there
are not logs entries between '01/Mar/2019:05:13:43' and '01/Mar/2019:05:30:30' just after service reboot.

I'm adding below, logs of each instance, please let me know in case that you found more interpretations for the logs.

I've attached the logs for all instances, corresponding with the reboot.
I've not reported this in the issue https://github.com/dmwm/DBS/issues/595
because, I don't know if this is should be the correct thread.

Best Regards, Lina.
[0] '****** DBSGlobalReader-20190301.log
INFO:cherrypy.access:[01/Mar/2019:05:13:43] vocms0161.cern.ch 148.187.39.58 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=%2Fstore%2Fdata%2FRun2018B%2FJetHT%2FAOD%2F17Sep2018-v1%2F00005%2F4B1D487A-0A8F-734F-A311-3701D447F2AC.root HTTP/1.1" 200 OK [data: - in 2 out 3920 us ] [auth: OK "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Viktor Gerhard Kutzner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:root:Missing componentDir and General section in config
ERROR:root:Going to trust you to know what you're doing.
CherryPy Checker:
The Application mounted at u'/dbs/prod/global/DBSReader' has an empty config.

INFO:cherrypy.access:[01/Mar/2019:05:30:30] vocms0161.cern.ch 2001:1458:d00:1a::7b "GET /dbs/prod/global/DBSReader/filesummaries?validFileOnly=1&dataset=%2FVBFHToEEG_M125_MLL-0To60_Dalitz_13TeV_madgraph_pythia8%2FRunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FAODSIM HTTP/1.1" 200 OK [data: - in 100 out 119906 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vlimant/CN=652956/CN=Jean-Roch Vlimant" "" ] [ref: "" "DBSClient/3.5.1/" ]

From: CERN Service Desk [service-desk@cern.ch]
Sent: Friday, March 01, 2019 5:32 AM
To: cms-service-webtools (Announcements from IT and VOC to Webtools. Discussions am...)
Subject: INC1930225 "[GNI] exception.cmsweb_dbs_is_down in vocmsweb/backend"

***Please do not reply to this email unless you are unsatisfied with the proposed solution***
Dear GNI,
There is a proposed solution for you.

Ticket No: INC1930225, Opened: 01-03-2019 05:26:28
Short description: [GNI] exception.cmsweb_dbs_is_down in vocmsweb/backend
Message from Pascal Marc Sicault:
service restarted
Should you require any further assistance, please inform us within 5 working days through the CERN Service Portal, or reply directly to this email. Otherwise, the system will automatically close the ticket on 08-03-2019, and it will be necessary to create a new ticket.

Kind Regards, CERN Service Desk

The list of the affected CIs follows: 
vocms0161

yuyiguo commented 5 years ago

@h4d4 Lina, The trigger of restarting DBS server on vocms0161 was because it failed to responding to the pin? I saw that the log file had no entries for about 15 minutes before the restarting. It would be nice that we could test it with the local curl call to DBS server directly before the restarting next time.

Did you get chance to install the new version of curl on dbs production nodes yet?

Thanks, Yuyi

h4d4 commented 5 years ago

@yuyiguo Yuyi, I'd say that this absence of logs entries in DBS DBSGlobalReader logs, was what trigered the alarm [0], because in the file that I've attached you, in the email, the other DB's intances has log entries during this time stamp. Then CERN IT's operators have responded to this alarm, restarting the service[1]. They follow procedure described here [2].

I've maybe missed the message to make test with latest curl version, I'll check above entries.

Best Regards, Lina. [0] https://cern.service-now.com/service-portal/view-incident.do?n=INC1930225 [1] https://cern.service-now.com/service-portal/view-incident.do?n=INC1930225 [2] https://cern.service-now.com/service-portal/article.do?n=KB0002246

h4d4 commented 5 years ago

@yuyiguo Yuyi,

In reference to:

'curl' installation for DBS's servers nodes, the command is already installed there.
necessary tests to run when DBS got stuck

Based in to above entries I understood that I should run: curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW Therefore, just to test this sentence for production, I've run it as is shown in [0].

Now my question is, which output is accepted, which one no? and which actions should I take based on the output?

Best Regards, Lina.

[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW" -u "cmsweb"

vocms0136:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0   3807      0 --:--:-- --:--:-- --:--:--  3888[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]

vocms0161:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0  15042      0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0163:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0   5566      0 --:--:-- --:--:-- --:--:--  5833[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0165:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0  15968      0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
[lijimene@aiadm87 ~]$

vkuznet commented 5 years ago

Lina, my recommendation is to check for presence "RAW" in curl output. If it is there it means DBS instance is responding with actual JSON. At this level we don't care how many RAW tiers we have, we just need to get a JSON. But if RAW is not in an output then it is a problem. SO, you need to trigger events/alerts when there is no RAW in an output. Best, Valentin.

On 0, Lina Jiménez notifications@github.com wrote:

@yuyi Yuyi,

In reference to:

'curl' installation for DBS's servers nodes, the command is already installed there.
necessary tests to run when DBS got stuck

Now my question is, which output is accepted, which one no? and which actions should I take based on the output?

Best Regards, Lina.

[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW" -u "cmsweb"

vocms0136:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0   3807      0 --:--:-- --:--:-- --:--:--  3888[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]

vocms0161:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0  15042      0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0163:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0   5566      0 --:--:-- --:--:-- --:--:--  5833[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0165:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0  15968      0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
[lijimene@aiadm87 ~]$

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-469245253