Closed yuyiguo closed 4 years ago
Hello all,
Sorry for have not been clear, I need to make some clarifications regarding provided value for 'limit for opened files in each backed'[0], this is the default limit value for 'file to open by system calls' for each whole node.
And that default value is set to '1468454' as is shown below, this configuration is already set up on the nodes.
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "cat /proc/sys/fs/file-max" -u "cmsweb"
vocms0136:
1468646
vocms0161:
1468454
vocms0163:
1468464
vocms0165:
1468454
[lijimene@aiadm87 ~]$
Also a recalling, DBS in hosted in general propose backends, this means that other services are also hosted there[1], as for example crabserver and Reqmgr2.
Yuyi @yuyiguo , Valentin @vkuznet, I'd like to know if at the moment you consider that we should increase the above value. We have an old email thread about this same Error, the email subject is: 'October release of DBS'.
Regarding the proposed tests, let me review this, I think that this should be run first on testbed, what do you think?
Best Regards, Lina.
[0]
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "ulimit -a | grep open" -u "cmsweb"
vocms0136:
open files (-n) 1024
vocms0161:
open files (-n) 1024
vocms0163:
open files (-n) 1024
vocms0165:
open files (-n) 1024
[1] https://cms-http-group.web.cern.ch/cms-http-group/activity.html
Lina, the /rpco/sys/fs/file-max defines a limit on total number of file descriptors we can have on a system, while ulimit shoes soft/hard limits each individual user may have. I think what we need is to increase soft/hard limits for _dbs user who runs DBS services.
Please read about soft/hard/system limits here: https://unix.stackexchange.com/questions/84227/limits-on-the-number-of-file-descriptors and how to change them for particular users here: https://cs.uwaterloo.ca/~brecht/servers/openfiles.html
It seems to me that soft/hard limits for _dbs users should be increased to something like 65535. But it would be nice to find out actual limit through simulation I proposed, i.e. watch how much file descriptors DBS uses with increased load.
Regarding tests, as I wrote it should be done on private VMs and not on production nodes. Since we use testbed for other purposes it is also not a good candidate for tests. But you can change first file descriptors limits on testbed.
Best, Valentin.
On 0, Lina Jiménez notifications@github.com wrote:
Hello all,
Sorry for have not been clear, I need to make some clarifications regarding provided value for 'limit for opened files in each backed'[0], this is the default limit value for 'file to open by system calls' for each whole node.
And that default value is set to '1468454' as is shown below, this configuration is already set up on the nodes.
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "cat /proc/sys/fs/file-max" -u "cmsweb" vocms0136: 1468646 vocms0161: 1468454 vocms0163: 1468464 vocms0165: 1468454 [lijimene@aiadm87 ~]$
Also a recalling, DBS in hosted in general propose backends, this means that other services are also hosted there[1], as for example crabserver and Reqmgr2.
Yuyi @yuyiguo , Valentin @vkuznet, I'd like to know if at the moment you consider that we should increase the above value. We have an old email thread about this same Error, the email subject is: 'October release of DBS'.
Regarding the proposed tests, let me review this, I think that this should be run first on testbed, what do you think?
Best Regards, Lina.
[0] [lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "ulimit -a | grep open" -u "cmsweb" vocms0136: open files (-n) 1024 vocms0161: open files (-n) 1024 vocms0163: open files (-n) 1024 vocms0165: open files (-n) 1024
[1] https://cms-http-group.web.cern.ch/cms-http-group/activity.html
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-465596514
Thanks Lina for finding out more data. I also appreciate Valentin's explanation. Yes, please increase the number for _dbs first while we are preparing the test. I feel it is better to run the simulation test on cmsweb testbed because it is very close to prod. We may run the test between the releases while people are not testing against testbed.
Thanks @vkuznet , you rigth with the explanation.
For the increasing, I've set up the value as sugested, I've also used this reference[0]
This is the value before changes:
[lijimene@aiadm02 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files (-n) 1024
vocms0161:
open files (-n) 1024
vocms0163:
open files (-n) 1024
vocms0165:
open files (-n) 1024
And this is after changes:
[lijimene@aiadm02 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files (-n) 65535
vocms0161:
open files (-n) 65535
vocms0163:
open files (-n) 65535
vocms0165:
open files (-n) 65535
I've set up the value into file /etc/security/limits.conf
as references sugested.
I tested in ad dev node and the value is kept after rebooting.
En reference to the tests, @yuyiguo yes testbed provides a closer environment to production, for example, frontend and backend software stuff are in differentes nodes, such in production. While in dev, both frontend and backend are in the same node. Current testbed phase will be finish for 5th of March, we could use this without affecting current testing release.
Best Regars, Lina. [0] https://easyengine.io/tutorials/linux/increase-open-files-limit
@yuyiguo @vkuznet
Hello All, below a service's behaviour report regarding latest issue: 'DBS stuck again' reported on 'cms-service-webtools' The last restart that I did to the service was today(Monday 25th of February), as below:
around 11:47h GVA time in node vocms0136. there the amount of opened files was higher('12160' ) for DBSGlobalReader '24737', compared with the others nodes[1], but it was under the set up limit('65535')[2]. In [0] full output.
around 13:06h GVA time in the rest of the nodes in [1] the amount of opened files
My first thought is that the high amount of opened files on vomcs0136 was not related with this stuck. I'll coming back with more info from the logs.
Best Regards, Lina. [0]
[cmsweb@vocms0136 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 24729
DBSGlobalReader is RUNNING, PID 24737
DBSGlobalWriter is RUNNING, PID 24745
DBSPhys03Reader is RUNNING, PID 24753
DBSPhys03Writer is RUNNING, PID 24761
DBSMigrate is RUNNING, PID 24769
[cmsweb@vocms0136 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0136 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 24729 #opened fd:
83
PID 24737 #opened fd:
12160
PID 24745 #opened fd:
86
PID 24753 #opened fd:
84
PID 24761 #opened fd:
84
PID 24769 #opened fd:
79
[cmsweb@vocms0136 dbs]$
[1]
[cmsweb@vocms0161 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 7172
DBSGlobalReader is RUNNING, PID 7180
DBSGlobalWriter is RUNNING, PID 7188
DBSPhys03Reader is RUNNING, PID 7196
DBSPhys03Writer is RUNNING, PID 7204
DBSMigrate is RUNNING, PID 7212
[cmsweb@vocms0161 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0161 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 7172 #opened fd:
83
PID 7180 #opened fd:
11939
PID 7188 #opened fd:
85
PID 7196 #opened fd:
84
PID 7204 #opened fd:
85
PID 7212 #opened fd:
79
[cmsweb@vocms0161 dbs]$
[cmsweb@vocms0165 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'"
DBSDef is RUNNING, PID 12715
DBSGlobalReader is RUNNING, PID 12723
DBSGlobalWriter is RUNNING, PID 12731
DBSPhys03Reader is RUNNING, PID 12739
DBSPhys03Writer is RUNNING, PID 12747
DBSMigrate is RUNNING, PID 12755
[cmsweb@vocms0165 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'`
[cmsweb@vocms0165 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done
PID 12715 #opened fd:
83
PID 12723 #opened fd:
11684
PID 12731 #opened fd:
85
PID 12739 #opened fd:
84
PID 12747 #opened fd:
86
PID 12755 #opened fd:
79
[cmsweb@vocms0165 dbs]$
[2]
[lijimene@aiadm90 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb"
vocms0136:
open files (-n) 65535
vocms0161:
open files (-n) 65535
vocms0163:
open files (-n) 65535
vocms0165:
open files (-n) 65535
[lijimene@aiadm90 ~]$
Lina, that's great info. For the first time we see the following:
sudo ls -al /proc/20030/fd
total 0
dr-x------. 2 prometheus prometheus 0 Feb 21 22:20 .
dr-xr-xr-x. 9 prometheus prometheus 0 Feb 21 22:20 ..
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 0 -> /dev/null
lrwx------. 1 prometheus prometheus 64 Feb 21 22:20 1 -> socket:[56351]
lrwx------. 1 prometheus prometheus 64 Feb 21 22:20 10 -> socket:[56403]
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 11 -> /var/lib/prometheus/01D4FBQ5DD9D78378NBNW3TP41/chunks/000001
lr-x------. 1 prometheus prometheus 64 Feb 21 22:20 12 -> /var/lib/prometheus/01D4BG431ZHKD38KSTQZB0X3XB/index
...
It has socket fd's and fd's pointing to /var/lib area which correspond to its open files used by the process.
We need to start looking into what DBSGlobalReader is doing and understand why it has so many fds. Again I don't know what it should have and what is "large" for it.
I finally got prometheus code to monitor this basic info and I'll send another email with explanation.
Best, Valentin.
On 0, Lina Jiménez notifications@github.com wrote:
@yuyiguo @vkuznet
Hello All, below a service's behaviour report regarding latest issue: 'DBS stuck again' reported on 'cms-service-webtools' The last restart that I did to the service was today(Monday 25th of February), as below:
around 11:47h GVA time in node vocms0136. there the amount of opened files was higher('12160' ) for DBSGlobalReader '24737', compared with the others nodes[1], but it was under the set up limit('65535')[2]. In [0] full output.
around 13:06h GVA time in the rest of the nodes in [1] the amount of opened files
My first thought is that the high amount of opened files on vomcs0136 was not related with this stuck. I'll coming back with more info from the logs.
Best Regards, Lina. [0]
[cmsweb@vocms0136 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" DBSDef is RUNNING, PID 24729 DBSGlobalReader is RUNNING, PID 24737 DBSGlobalWriter is RUNNING, PID 24745 DBSPhys03Reader is RUNNING, PID 24753 DBSPhys03Writer is RUNNING, PID 24761 DBSMigrate is RUNNING, PID 24769 [cmsweb@vocms0136 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'` [cmsweb@vocms0136 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done PID 24729 #opened fd: 83 PID 24737 #opened fd: 12160 PID 24745 #opened fd: 86 PID 24753 #opened fd: 84 PID 24761 #opened fd: 84 PID 24769 #opened fd: 79 [cmsweb@vocms0136 dbs]$
[1]
[cmsweb@vocms0161 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" DBSDef is RUNNING, PID 7172 DBSGlobalReader is RUNNING, PID 7180 DBSGlobalWriter is RUNNING, PID 7188 DBSPhys03Reader is RUNNING, PID 7196 DBSPhys03Writer is RUNNING, PID 7204 DBSMigrate is RUNNING, PID 7212 [cmsweb@vocms0161 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'` [cmsweb@vocms0161 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done PID 7172 #opened fd: 83 PID 7180 #opened fd: 11939 PID 7188 #opened fd: 85 PID 7196 #opened fd: 84 PID 7204 #opened fd: 85 PID 7212 #opened fd: 79 [cmsweb@vocms0161 dbs]$ [cmsweb@vocms0165 dbs]$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" DBSDef is RUNNING, PID 12715 DBSGlobalReader is RUNNING, PID 12723 DBSGlobalWriter is RUNNING, PID 12731 DBSPhys03Reader is RUNNING, PID 12739 DBSPhys03Writer is RUNNING, PID 12747 DBSMigrate is RUNNING, PID 12755 [cmsweb@vocms0165 dbs]$ pids=` sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status 'I did read documentation'" | awk '{ORS=" "; print $5}'` [cmsweb@vocms0165 dbs]$ for i in $pids; do echo PID $i '#'opened fd:; (sudo lsof -a -p $i | wc -l); done PID 12715 #opened fd: 83 PID 12723 #opened fd: 11684 PID 12731 #opened fd: 85 PID 12739 #opened fd: 84 PID 12747 #opened fd: 86 PID 12755 #opened fd: 79 [cmsweb@vocms0165 dbs]$
[2]
[lijimene@aiadm90 ~]$ broadcast -n "gp-prod" -c "sudo -H -u _dbs bashs -l -c 'ulimit -aHS' -s '/bin/bash' | grep open" -u "cmsweb" vocms0136: open files (-n) 65535 vocms0161: open files (-n) 65535 vocms0163: open files (-n) 65535 vocms0165: open files (-n) 65535 [lijimene@aiadm90 ~]$
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467027300
@h4d4 Lina, Thank you so much for such detailed info. I have one question, When you were checking DBS while people reporting DBS stuck, You found the read server on 0136 was not responding? You did not see it handling queries in the log file, but the rest three readers were fine?
The 0136 reader had 121610 open files, the other two has 11939 and 11684 ( we missed 0163). To me 0136 did not have significant more opened files than the other two.
@vkuznet To answer you questions why the reader has much higher open files, it is because the readers are the most accessed. I will follow up with your suggestions. After that I will check if there is any leaking in DBS code. From the email I got, the earliest report of DBS stuck was 6:05 am CT Feb 25 .
Lina, Would you please give me access to cmsweb prod dbs backends so I can check the ope fds?
Thanks, Yuyi
@yuyiguo Yuyi,
would you please try to access any of the nodes vocms01[36,61,63,65], and let me know how it is going? 'yuyi' user is already into interactive users for those nodes[1].
Best Regards, Lina. [0] https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsweb/blob/master/data/fqdns/vocms0165.cern.ch.yaml
@h4d4 Lina, Just tried. I was able to access 0136/63/65, but not 61. Thanks, Yuyi
@yuyiguo I'm going to check 0161. Regarding this question "When you were checking DBS while people reporting DBS stuck, You found the read server on 0136 was not responding":
DBSGlobalReader was responding queries just before I restarted it, please see below logs, from bottom to top.
INFO:cherrypy.access:[25/Feb/2019:11:45:41] vocms0136.cern.ch 131.169.77.46 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 105780778 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:42] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/filesummaries/?dataset=%2FDoubleEG%2FRun2016B-03Feb2017_ver2-v2%2FMINIAOD&validFileOnly=1 HTTP/1.1" 200 OK [data: - in 108 out 126402 us ] [auth: OK "/DC=ch/DC=cern/OU
=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:45] vocms0136.cern.ch 131.169.164.18 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 105549251 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:47] vocms0136.cern.ch 131.169.84.103 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 108557093 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:54] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FMuonEG%2FRun2018D%2A%2F%2A&dataset_access_type=VALID&detail=True HTTP/1.1" 200 OK [data: - in 4896 out 72884 us ] [auth: OK "/DC=ch/DC=cern/OU=Org
anic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:45:56] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 83410 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:34] vocms0136.cern.ch 131.169.77.62 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 116154577 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:36] vocms0136.cern.ch 131.169.163.137 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 122021117 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:46:42] vocms0136.cern.ch 131.169.163.128 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 112493682 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:cherrypy.error:Mon Feb 25 10:46:48 2019 dbsException-invalid-input2: '/mc-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /m
c-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258) cannot pass input checking
INFO:cherrypy.access:[25/Feb/2019:11:46:48] vocms0136.cern.ch 2001:1458:d00:14::8b "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2Fmc-physval-120-DrellYan-LowLumi%2FFEVT%2FCMSSW_1_2_0-FEVT-1166809258&dataset_access_type=%2A&detail=True HTTP/1.1" 400 Bad Request [data: - in 113
out 13535 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:07] vocms0136.cern.ch 131.169.84.186 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 105578002 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:09] vocms0136.cern.ch 131.169.164.34 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 101181541 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:12] vocms0136.cern.ch 131.169.163.128 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 98030344 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:16] vocms0136.cern.ch 131.169.77.17 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 103407271 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:18] vocms0136.cern.ch 131.169.77.226 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 101622105 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:cherrypy.error:Mon Feb 25 10:47:21 2019 dbsException-invalid-input2: '/mc-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /m
c-physval-120-DrellYan-LowLumi/FEVT/CMSSW_1_2_0-FEVT-1166809258) cannot pass input checking
INFO:cherrypy.access:[25/Feb/2019:11:47:21] vocms0136.cern.ch 2001:1458:d00:14::8b "POST /dbs/prod/global/DBSReader/datasetlist HTTP/1.1" 400 Bad Request [data: - in 114 out 25655 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez
Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:29] vocms0136.cern.ch 131.169.84.145 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 106075925 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:32] vocms0136.cern.ch 131.169.77.6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: -
in 151500905 out 108741553 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:33] vocms0136.cern.ch 131.169.77.6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: -
in 151500905 out 107828326 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:36] vocms0136.cern.ch 131.169.77.215 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 109157985 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:38] vocms0136.cern.ch 131.169.77.76 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 108701256 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:46] vocms0136.cern.ch 131.169.84.254 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 107245343 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:47] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 32045 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:48] vocms0136.cern.ch 131.169.163.112 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data
: - in 151500905 out 109296392 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:49] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FHighMultiplicity%2FRun2015D-v1%2FRECO&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data: - in 2 out 73047 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:47:57] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/releaseversions/?dataset=%2FJetHT%2FRun2016B-07Aug17_ver2-v2%2FMINIAOD HTTP/1.1" 200 OK [data: - in 4 out 67963 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=User
s/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:00] vocms0136.cern.ch 2001:1458:d00:14::8b "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FMinBias_Hydjet_Drum5F_2018_5p02TeV%2FHINPbPbAutumn18GS-103X_upgrade2018_realistic_HI_v11-v1%2FGEN-SIM&detail=True HTTP/1.1" 200 OK [data: - in 669
out 63482 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:03] vocms0136.cern.ch 2001:1458:d00:14::a6 "GET /dbs/prod/global/DBSReader/datasets/?dataset=%2FVBFHToTauTau%2A%2F%2A18%2A%2FMINIAODSIM&dataset_access_type=VALID&detail=True HTTP/1.1" 200 OK [data: - in 7183 out 272482 us ] [auth: OK "/DC=ch/
DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "dasgoserver" ]
INFO:cherrypy.access:[25/Feb/2019:11:48:40] vocms0136.cern.ch 131.169.77.129 "GET /dbs/prod/global/DBSReader/files/?dataset=%2FNeutrino_E-10_gun%2FRunIISummer17PrePremix-MCv2_correctPU_94X_mc2017_realistic_v9-v1%2FGEN-SIM-DIGI-RAW&detail=True&validFileOnly=1 HTTP/1.1" 200 OK [data:
- in 151500905 out 114096627 us ] [auth: OK "/C=DE/O=GermanGrid/OU=DESY/CN=David Brunner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:root:Missing componentDir and General section in config
ERROR:root:Going to trust you to know what you're doing.
CherryPy Checker:
The Application mounted at u'/dbs/prod/global/DBSReader' has an empty config.
@yuyiguo Yuyi,
now username 'yuyi' should have accesses to vocms0161. Let me know in case that you need root privileges.
Best Regards, Lina.
One CRAB user reported a problem [1] with crabcache service (which does not talk to DBS at all) at same time as this was ongoing, I could reproduce it (repeatedly) later on both on cmsweb and cmsweb-testbed. [2] The problem present itself as an odd error inside cherrypy leading to HTTP 500. The problem is gone today that also DBS work. I did not investigate in great details, only thing I can say is that simple queries to crabcache worked, while the failing ones all required a largish (tens of MB's) file to be uploaded to the cache, note that we will never upload anything larger than 100MB. I can not say it this is accidental correlation, if there was a common cause, or if simply DBS problem was creating so many long-running cmsweb FE/BE connections to make FE->crabcache slow down to the point of triggering a poorly handled error. I post this information in case it may help to hunt for the problem.
[1] https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4624.html https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4629/2/1.html
[2] from log on vocms055:
belforte@vocms055/~> grep -A13 belfor /build/srv-logs/vocms0741/crabcache/crabcache-20190224.log
[24/Feb/2019:22:03:48] vocms0741.cern.ch 131.225.189.232 "PUT /crabcache/file HTTP/1.1" 200 OK [data: 41905410 in 115 out 922477150 us ] [auth: OK "/DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Stefano Belforte belforte@infn.it" "" ] [ref: "" "PycURL/7.43.0.1 libcurl/7.59.0 OpenSSL/1.0.2d zlib/1.2.8" ]
[24/Feb/2019:22:03:48] Traceback (most recent call last):
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 180, in trap
return func(*args, **kwargs)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 99, in __call__
return self.nextapp(environ, start_response)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 411, in tail
return self.response_class(environ, start_response, self.cpapp)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 266, in __init__
self.close()
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 288, in close
if streaming and is_closable_iterator(self.iter_response):
AttributeError: 'AppResponse' object has no attribute 'iter_response'
the client log for the failed command shows that it was executing e.g. (client ran on a different time zone, and I have multiple such messages, ignore the time stamp)
DEBUG 2019-02-24 14:48:20,126: Uploading archive /uscms_data/d2/belforte/crab_20190224_144747/inputs/3710c06d-8d32-4e33-a3b3-03d57d868681default.tgz (39 MB) to the CRAB cache. Using URI https://cmsweb.cern.ch/crabcache
INFO 2019-02-24 15:03:48,795: The server answered with an error.
DEBUG 2019-02-24 15:03:48,795: ERROR: Internal Server Error (500): 'Unrecoverable error in the server.
Traceback (most recent call last):
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 180, in trap
return func(*args, **kwargs)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 99, in __call__
return self.nextapp(environ, start_response)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 411, in tail
return self.response_class(environ, start_response, self.cpapp)
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 266, in __init__
self.close()
File "/data/srv/beHG1902e/sw/slc7_amd64_gcc630/external/cherrypy/5.4.0-comp2/lib/python2.7/site-packages/cherrypy/_cpwsgi.py", line 288, in close
if streaming and is_closable_iterator(self.iter_response):
AttributeError: 'AppResponse' object has no attribute 'iter_response'
DEBUG 2019-02-24 15:03:48,796: Command failed with URI: https://cmsweb.cern.ch:8443/crabcache/file
DEBUG 2019-02-24 15:03:48,796: Input data: [('inputfile', (10, '/uscms_data/d2/belforte/crab_20190224_144747/inputs/3710c06d-8d32-4e33-a3b3-03d57d868681default.tgz')), ('hashkey', '67a942612289077c72134faf4b88f5ca03fcc1fbc8f6b5b39c6eff423d939e64'), ('newchecksum', '2')]
DEBUG 2019-02-24 15:03:48,796: Request headers: {'Content-Length': '1113', 'Set-Cookie': 'cms-auth=a11ac7c1bc76ff625d4c1219254a94a3a26c81b21fc0133faa67785f76ea2985dbbf0e83188563db;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Server': '0.0.0.0', 'Connection': 'close', 'Date': 'Sun, 24 Feb 2019 21:03:48 GMT', 'Content-Type': 'text/plain', 'CMS-Server-Time': 'D=922820160 t=1551041305918821'}
@h4d4 do I understand correctly that in the end the problem was "solved" by simply restarting all of DBS BackEnds sort of "yet another time" even if you did that already during the weekend ?
@belforte Stefano, yes it was exactly what happened.
Thanks Lina.
I have no idea if it will help, or making things work, but do you and Valentin want to follow up on the idea to enable others to issue that restart, in case it is all that's needed ? Tommaso asked whether CRC's could/should have the needed rights, but when you restarted and it did not work, I did not pursue this. In any case, even if "us past CRC's like Tom and myself" had the rights, I am not sure what the right way to do is nor would tamper with CMSWEB unless instructed to do so.
Yes, Lina I can access 0161 now.
@h4d4 @belforte Lina, Stefano, I am a bit confused about what you two discussed. Lina showed that before she restarted DBS servers, DBS reads already handled requests. Stefano found Crab could not connect to DBS at all and crab started to connect to DBS only after the second DBS restart this morning. What I missed here?
I am similarly confused. And I still encourage not to consider this a CRAB problem. I repeatedly tried a simple dasgoclient command and it was hanging. At least this was the case Sunday. I did not check between the two restarts (one server only first time, all of them second time, IIUC) this morning since it was night for me. I used dasgoclient since I hope it is easier for everybody else to do the same and verify if DBS is answering or not on CMSWEB w/o asking CRAB people, I can surely use direct DBS Client as well...
@yuyiguo @belforte
What I meant to say, is that yes, the Issue reported by users about failed connections to DBS was solved after today restarted. And the point is that the service was handled requests.
Stefano, if I recall correctly the CMS operator on duty can restart cmsweb data-services. Did something changed? I think what's is missing is proper monitoring and alerting that particular data-service is not responding. The way we monitor cmsweb services is via ServerMonitor daemon which runs on every node and perform an action based on service monitoring script. For DBS the service monitoring script is based on context of this file: https://github.com/dmwm/deployment/blob/master/dbs/monitoring.ini and what it only pings DBS server at this URL http://localhost:8250/dbs/
My bet that many issues with DBS service may not be caught via this ping command, e.g. the DBS HTTP server may be fine, but underneath it can struggle with particular requests. I think the issue here is to define proper command against DBS server to check the health of the system. I think it is better to ping particular DBS query, e.g. http://localhost:8250/dbs/prod/global/DBSReader/datasets?dataset=/a/b/c (with valid dataset name) which will cause DBS server to perform the query. If such command failed or stalled the ServerMonitor daemon will notice and issue proper alert.
Best, Valentin.
On 0, Stefano Belforte notifications@github.com wrote:
Thanks Lina.
I have no idea if it will help, or making things work, but do you and Valentin want to follow up on the idea to enable others to issue that restart, in case it is all that's needed ? Tommaso asked whether CRC's could/should have the needed rights, but when you restarted and it did not work, I did not pursue this. In any case, even if "us past CRC's like Tom and myself" had the rights, I am not sure what the right way to do is nor would tamper with CMSWEB unless instructed to do so.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467144509
@yuyiguo Another comment, sorry if this confused more, but is just to know. An user reported[0] this type of error
And looking into the logs this is treated as below, that is an client side error HTTP 400:
ERROR:cherrypy.error:Mon Feb 25 12:12:24 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking
Best Regards, Lina. [0] https://cern.service-now.com/nav_to.do?uri=%2Fincident.do%3Fsys_id%3D22eb0571dbfbaf40ae659ec4db96196a%26sysparm_stack%3Dincident_list.do%3Fsysparm_query%3Dactive%3Dtrue [1]
match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /GluGluHToZZTo4L_M125_14TeV_powheg2_JHUgenV702_pythia8/PhaseIIFall16DR82-NoPU_90X_upgrade2023_realistic_v1-v1/GEN-SIM-RECO#80674502-e1b6-11e6-933c-002590494c06) cannot pass input checking
./DBSGlobalReader-20190225.log:8953:ERROR:cherrypy.error:Mon Feb 25 11:38:23 2019 dbsException-invalid-input2: '/Tau/*/*/*' does not match regular expression ^/(\*|[a-zA-Z\*][a-zA-Z0-9_\*\-]{0,100})(/(\*|[a-zA-Z0-9_\.\-\*]{1,199})){0,1}(/(\*|[A-Z\-\*]{1,50})){0,1}$ key-value pair (dataset, /Tau/*/*/*) cannot pass input checking
./DBSGlobalReader-20190225.log:9367:ERROR:cherrypy.error:Mon Feb 25 11:57:54 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking
./DBSGlobalReader-20190225.log:11856:ERROR:cherrypy.error:Mon Feb 25 12:12:24 2019 dbsException-invalid-input2: '/MET' does not match regular expression ^/([a-zA-Z0-9]+[a-zA-Z0-9\-\._]*)/([a-zA-Z]+[a-zA-Z0-9_\-\.]*)/([A-Z_\-\.]+)$ key-value pair (dataset, /MET) cannot pass input checking
@h4d4 this is fine. The user did a mistake and the server is telling him/her so. i.e. working as designed.
I clsoed that SNOW ticket. CMS users should report problem to our HN, not make noise with multiple unappropriate report channels. Actually they should read our documentation, so they will know how to use services and how to report problems. Lina, please do not waste your time on those things.
@vkuznet about restart by operator I suspect that you are confusing CERN IT operators with CMS people. Indeed CERN IT on duty operators receives Lemon alarms (like swap full, or the process not running that you indicate) and in cases like DBS where CMS provided proper information, can issue some simple commands to restart things. As you say, when processes are running, yet not doing what we like, it is for CMS people to make a call on what to do. Currently only Lina acts in that situation, which is putting her on a sort of 24x7 on-call duty which is really bad for her, and insufficient for us. The question was whether people like me, Tommaso, Dyma, Stephan,.... can help. We have O(10) experienced people in this CMS Computing Run Coordinator list which maybe can also restart the service (but I am not sure if we have the proper authorizations nor the instructions). Most of those people have no understanding of how things work (think of me e.g.) but can be somehow trusted not to do things foolishly and at least to document and report what they do, if they ever do. SImply a way to help cover during off hours.
Hope I made myself more clear, sorry for hijacking the Issue.
I think you partially answered your question. If we grant CERN IT operators permissions to restart our service I don't see any reason why we can't grant such rights to CMS experts. The only question is a policy which should define
I think it should be management decision and not cmsweb one. The rights comes with responsibilities and if decided to issue those rights at least L1/L2's should agree on that.
Regarding the instructions, we do have common command for ALL cmsweb services, i.e. our procedure is standardized. The only command anyone who deal with cmsweb needs to know is how to use manage scripts. I think we provide this instructions to CERN IT and the same instructions can be used by CMS experts. Lina knows where they are.
Best, Valentin.
On 0, Stefano Belforte notifications@github.com wrote:
@vkuznet about restart by operator I suspect that you are confusing CERN IT operators with CMS people. Indeed CERN IT on duty operators receives Lemon alarms (like swap full, or the process not running that you indicate) and in cases like DBS where CMS provided proper information, can issue some simple commands to restart things. As you say, when processes are running, yet not doing what we like, it is for CMS people to make a call on what to do. Currently only Lina acts in that situation, which is putting her on a sort of 24x7 on-call duty which is really bad for her, and insufficient for us. The question was whether people like me, Tommaso, Dyma, Stephan,.... can help. We have O(10) experienced people in this CMS Computing Run Coordinator list which maybe can also restart the service (but I am not sure if we have the proper authorizations nor the instructions). Most of those people have no understanding of how things work (think of me e.g.) but can be somehow trusted not to do things foolishly and at least to document and report what they do, if they ever do. SImply a way to help cover during off hours.
Hope I made myself more clear, sorry for hijacking the Issue.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467178246
@h4d4 about "was it running or was it not running when restarted" seems to me that the sequence of events is:
so the question is whether 4. solved and you did 5. because someone was still, but wrongly, complaining. Or whether something else happened between 4. and 5. which made DBS (and crabcache !) work again. Or whether it simply needed some time. Or whether me or Yuyi did not understood correctly. Or whatever else.. but 6. points to "5. was not needed".
I was sleeping between 4. and 5. ! But can add these data points now from production CRAB TaskWorker logs : (note this is the CRAB thing which runs outside CMSWEB and queries it)
Indeed the picture which I see now is that DBS was answering to just a tiny fraction of queries, until it started to work again. I do not know if it is important that I spend time trying to get error/success fraction from log sort of for every hour, but the following which lists the first 50 successful calls is quite telling:
belforte@crab-prod-tw01/logs> grep DBSDataDiscovery,267:Got twlog.txt|head -50
2019-02-25 05:42:59,300:DEBUG:DBSDataDiscovery,267:Got 92 files
2019-02-25 09:20:31,102:DEBUG:DBSDataDiscovery,267:Got 300 files
2019-02-25 10:58:45,801:DEBUG:DBSDataDiscovery,267:Got 10 files
2019-02-25 12:33:31,317:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:38:30,797:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:44:04,612:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:46:00,414:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:48:56,034:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 12:53:20,633:DEBUG:DBSDataDiscovery,267:Got 1023 files
2019-02-25 13:07:30,010:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:07:30,465:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:07:30,868:DEBUG:DBSDataDiscovery,267:Got 1 files
2019-02-25 13:07:48,899:DEBUG:DBSDataDiscovery,267:Got 5 files
2019-02-25 13:07:53,694:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:07:55,980:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:07:56,264:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:00,308:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:10,032:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:13,149:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:14,054:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:16,519:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:08:16,519:DEBUG:DBSDataDiscovery,267:Got 3 files
2019-02-25 13:08:20,640:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:08:21,999:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:25,921:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:51,976:DEBUG:DBSDataDiscovery,267:Got 2522 files
2019-02-25 13:08:52,081:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:52,130:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:52,156:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:08:55,694:DEBUG:DBSDataDiscovery,267:Got 41 files
2019-02-25 13:09:01,663:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:01,663:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:05,099:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:06,112:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:06,908:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:09,069:DEBUG:DBSDataDiscovery,267:Got 2530 files
2019-02-25 13:09:10,067:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:18,327:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:22,252:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:22,462:DEBUG:DBSDataDiscovery,267:Got 2527 files
2019-02-25 13:09:44,935:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:09:47,016:DEBUG:DBSDataDiscovery,267:Got 4 files
2019-02-25 13:09:47,041:DEBUG:DBSDataDiscovery,267:Got 6 files
2019-02-25 13:10:28,604:DEBUG:DBSDataDiscovery,267:Got 7 files
2019-02-25 13:10:29,091:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:29,147:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:29,174:DEBUG:DBSDataDiscovery,267:Got 46 files
2019-02-25 13:10:34,634:DEBUG:DBSDataDiscovery,267:Got 71 files
2019-02-25 13:10:34,634:DEBUG:DBSDataDiscovery,267:Got 71 files
2019-02-25 13:10:40,221:DEBUG:DBSDataDiscovery,267:Got 46 files
belforte@crab-prod-tw01/logs>
Valentin suggested to test against the localhost and I was thinking on the same line. Some test against the local host directly using below query: [0] http://localhost:8250/dbs/prod/global/DBSReader/datatierss?data_tier=RAW We can use the same one against the front end too. [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatierss?data_tier=RAW
The order of the test should be [1], then [0] on each nodes.
Yuyi, the test against cmsweb will not work because of authentication, client ( in this case ServeMonitor daemon) needs support for x509 certificates , I doubt it can do that. But you can use dasgoclient to do the work, but in this case you need infrastructure to support x509 proxy and write your own monitoring daemon.
On Feb 25, 2019, at 17:15, Yuyi Guo notifications@github.com wrote:
Valentin suggested to test against the localhost and I was thinking on the same line. Some test against the local host directly using below query: [0] http://localhost:8250/dbs/prod/global/DBSReader/datatierss?data_tier=RAW We can use the same one against the front end too. [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatierss?data_tier=RAW
The order of the test should be [1], then [0] on each nodes.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Sorry, the correct urls are: [1] https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW [0]http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW
I tried below on my VM: sudo -u root curl http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW {"exception": 403, "message": "You are not allowed to access this resource.", "type": "HTTPError"} I replaced root with _dbs, and others, I got the same error. What is needed to run the curl?
I also tried use my cert that works with the front end, but not with the directly access to backend.
Yuyi, you don't need sudo for curl call. The later makes HTTP call to remote server, while former execute local UNIX process, so it is irrelevant in this case.
And, your issue comes from the fact that all of DBS APIs are secured under WMCore REST server, see https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/web/DBSReaderModel.py#L96
The authzfunc in your case is authInsert, see https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/web/DBSReaderModel.py#L54 which checks roles of incoming HTTP requests. The roles are setup by frontend server based on user certificate. Therefore in order to use curl call to localhost you must supply in your curl request the roles or create DBS un-secure method.
I don't recall from my head how roles are set in HTTP requests, but you can find out them either by inspecting frontend cmsauth module or use your DBS code and insert printouts and make curl call to https endpoint to see them.
Thanks @vkuznet for the comments. I realized myself too this morning that the denying access is due to the role, not OS permission. You confirmed it. So looking into the code, I found a back door for the role [1]. In order to fake it, I need a new version of curl. So I did installation, then setup proper path and library. After all this, I was able to directly access the backend server while I am on the server node.
sudo -u root yum --enablerepo=cern install curl-openssl-7.57.0-1.1.el7.cern
export PATH=/opt/shibboleth/bin:$PATH
export LD_LIBRARY_PATH=/opt/shibboleth/lib64
curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW
[{"create_by": "CMSDBS", "creation_date": 1394471870, "data_tier_id": 100, "data_tier_name": "RAW"}]
@h4d4 Could you do install curl-openssl on DBS prod and int nodes? We need to run this kind of test next time if we have situation like yesterday,
Thanks, Yuyi
Yuyi, I'm glad you figure it out, it's good to know the option we may explore for testing WMCore based services. Now, I suggest that you should adopt this query to ServerMonitor. If read ServerMonitor code [1] correctly it provides fake auth headers, including cms-auth-status, when it does its job. Therefore I think you should only need to change your deployment/dbs/monitoring.ini file and change
# The ping test fetches the provided URL and look for the following perl regex
PING_URL="http://localhost:8250/dbs/"
PING_REGEX="Welcome to the DMWM web framework"
to
# The ping test fetches the provided URL and look for the following perl regex
PING_URL="http://localhost:8250/dbs/<DBSINSTANCE>/global/DBSReader/datatiers?data_tier_name=RAW"
PING_REGEX=".*data_tier_id.*data_tier_name.*RAW"
But I think you need 3 different monitoring.ini files, one for dev, one for testebed and one for production. You'll also need to change DBSINSTANCE accordingly.
Since you need three monitoring.ini files you'll need
Doing this way you can use datatiers DBS API to monitor DBS servers on all backends. The ServerMonitor will make this HTTP calls and if pattern will not match it will trigger alarm which goes to lemon.
Having proper version of curl on all backend nodes it helpful for manual debugging regardless of the service, therefore I would suggest to adjust all backend/frontend nodes to use newer version of curl.
The only question I have here is why do you need shibboleth path?
Best, Valentin.
[1] https://github.com/dmwm/deployment/blob/master/admin/ServerMonitor [2] https://github.com/dmwm/deployment/blob/master/frontend/deploy#L48
On 0, Yuyi Guo notifications@github.com wrote:
Thanks @vkuznet for the comments. I realized myself too this morning that the denying access is due to the role, not OS permission. You confirmed it. So looking into the code, I found a back door for the role [1]. In order to fake it, I need a new version of curl. So I did installation, then setup proper path and library. After all this, I was able to directly access the backend server while I am on the server node.
sudo -u root yum --enablerepo=cern install curl-openssl-7.57.0-1.1.el7.cern
export PATH=/opt/shibboleth/bin:$PATH
export LD_LIBRARY_PATH=/opt/shibboleth/lib64
curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/dev/global/DBSReader/datatiers?data_tier_name=RAW
[{"create_by": "CMSDBS", "creation_date": 1394471870, "data_tier_id": 100, "data_tier_name": "RAW"}]@h4d4 Could you do install curl-openssl on DBS prod and int nodes? We need to run this kind of test next time if we have situation like yesterday,
Thanks, Yuyi
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467557098
Yuyi, could you please edit original posting in this issue that it will be properly formatted. So far you cut and paste ASCII email I sent, here it is rendered as Markdown, therefore you should properly adjust/replace pound signs and put in triple quotes the commands I listed in email. So far it is really hard to read initial message.
@vkuznet The default yum install put crul-openssl into /opt/shibboleth/bin. I don't know how to edit comments already published in github issues. If I cut past them as new, the time line will be wrong. I
@yuyiguo you can only edit comments which you wrote. Click on the three dots top right of the comment header, a menu wll pop up which includes Edit
Go to top of the page and click on three dots in your original post. Since you're the owner it will give you option "Edit", click on it and edit the post.
On 0, Yuyi Guo notifications@github.com wrote:
@vkuznet The default yum install put crul-openssl into /opt/shibboleth/bin. I don't know how to edit comments already published in github issues. If I cut past them as new, the time line will be wrong. I
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-467600387
@yuyiguo the easier way will likely be to put three single quotes before and after the text that you pasted (in separate lines)
Thanks @belforte @vkuznet updated with 3 single quotes.
@amaltaro , Alan, I haven't heard you on the issue that DBS severed requests on and off over last weekend and even Monday morning. How was WM agents during that time? Did you get a lot of failed jobs? Thanks, Yuyi
The impact on WMAgent is not easily visible, and in general it keeps retrying until it succeeds. The consequences for the Central Production services might be (skipping unified): ) new workflows will failed to get splitted by global workqueue (not a big deal if you have a large backlog). It's a problem if you have super urgent small requests though ) workflows might failed to get acquired by the agents (local workqueue). Delaying the work acquisition process *) workflows might take longer to be announced, since DBS3Upload is probably failing to inject data into DBS.
So in summary, it's bad of course but not something that will cause failures requiring much more effort to sort them out (or recover whatever problems were caused).
Thanks Alan. Do you or others have data to show at what time period over last weekend workflows experienced a lot failing to split, upload to DBS3 or any actions needed to contact to DBS? I am trying to figure out why some of the DBS clients could not connect to DBS after Lina restarted DBS during the weekend and some are OK.
Yes, it should be doable. Global workqueue isn't continuously splitting work because it depends on what workflows are put in the system. But maybe you can have an idea by looking at the global workqueue logs, for instance:
amaltaro@vocms055:/build/srv-logs/vocms0740 $ grep -r 'Proxy Error' workqueue/reqmgrInteractionTask-20190223.log | wc -l
19
amaltaro@vocms055:/build/srv-logs/vocms0740 $ grep -r 'Proxy Error' workqueue/reqmgrInteractionTask-20190224.log | wc -l
68
Otherwise I can also have a look at an agent or two.
On 27/02/2019 09:32, Alan Malta Rodrigues wrote:
*) new workflows will failed to get splitted by global workqueue (not a big deal if you have a large backlog).
Yeah.. which is all of the difference with CRAB where we get a continous stream of new small (by P&R standard) requests.
*) workflows might take longer to be announced, since DBS3Upload is probably failing to inject data into DBS.
maybe not, since writing is handlds by a different server instance.
Stefano
Thanks Alan, I will look at the global workqueue logs.
Hello Yuyi,
regarding this alarm saying that DBS was down in '01/Mar/2019' around '05:26:28' GVA time (time in which ticket was opened).
And that therefore, the service was restarted around '05:31:32'
The 'correlation expression: (vocms0161:8065:1[5]_<_6)', that triggers the alarm says that at that moment only 5 of 6 process of the service where running. A process for each instance of the service.
But what I found into services instances logs, taking as reference this expression
"ERROR:root:Missing componentDir and General section in config",
(that as you know means that the service is being restarted),
is that there are not signal of problems or errors with the queries.
The only that I found weird is there into logs for '****** DBSGlobalReader-20190301.log'[0], there
are not logs entries between '01/Mar/2019:05:13:43' and '01/Mar/2019:05:30:30' just after service reboot.
I'm adding below, logs of each instance, please let me know in case that you found more interpretations for the logs.
I've attached the logs for all instances, corresponding with the reboot.
I've not reported this in the issue https://github.com/dmwm/DBS/issues/595
because, I don't know if this is should be the correct thread.
Best Regards, Lina.
[0] '****** DBSGlobalReader-20190301.log
INFO:cherrypy.access:[01/Mar/2019:05:13:43] vocms0161.cern.ch 148.187.39.58 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=%2Fstore%2Fdata%2FRun2018B%2FJetHT%2FAOD%2F17Sep2018-v1%2F00005%2F4B1D487A-0A8F-734F-A311-3701D447F2AC.root HTTP/1.1" 200 OK [data: - in 2 out 3920 us ] [auth: OK "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Viktor Gerhard Kutzner" "" ] [ref: "" "dasgoclient/v02.01.01" ]
ERROR:root:Missing componentDir and General section in config
ERROR:root:Going to trust you to know what you're doing.
CherryPy Checker:
The Application mounted at u'/dbs/prod/global/DBSReader' has an empty config.
INFO:cherrypy.access:[01/Mar/2019:05:30:30] vocms0161.cern.ch 2001:1458:d00:1a::7b "GET /dbs/prod/global/DBSReader/filesummaries?validFileOnly=1&dataset=%2FVBFHToEEG_M125_MLL-0To60_Dalitz_13TeV_madgraph_pythia8%2FRunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FAODSIM HTTP/1.1" 200 OK [data: - in 100 out 119906 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vlimant/CN=652956/CN=Jean-Roch Vlimant" "" ] [ref: "" "DBSClient/3.5.1/" ]
From: CERN Service Desk [service-desk@cern.ch]
Sent: Friday, March 01, 2019 5:32 AM
To: cms-service-webtools (Announcements from IT and VOC to Webtools. Discussions am...)
Subject: INC1930225 "[GNI] exception.cmsweb_dbs_is_down in vocmsweb/backend"
***Please do not reply to this email unless you are unsatisfied with the proposed solution***
Dear GNI,
There is a proposed solution for you.
Ticket No: INC1930225, Opened: 01-03-2019 05:26:28
Short description: [GNI] exception.cmsweb_dbs_is_down in vocmsweb/backend
Message from Pascal Marc Sicault:
service restarted
Should you require any further assistance, please inform us within 5 working days through the CERN Service Portal, or reply directly to this email. Otherwise, the system will automatically close the ticket on 08-03-2019, and it will be necessary to create a new ticket.
Kind Regards, CERN Service Desk
The list of the affected CIs follows:
vocms0161
@h4d4 Lina, The trigger of restarting DBS server on vocms0161 was because it failed to responding to the pin? I saw that the log file had no entries for about 15 minutes before the restarting. It would be nice that we could test it with the local curl call to DBS server directly before the restarting next time.
Did you get chance to install the new version of curl on dbs production nodes yet?
Thanks, Yuyi
@yuyiguo Yuyi, I'd say that this absence of logs entries in DBS DBSGlobalReader logs, was what trigered the alarm [0], because in the file that I've attached you, in the email, the other DB's intances has log entries during this time stamp. Then CERN IT's operators have responded to this alarm, restarting the service[1]. They follow procedure described here [2].
I've maybe missed the message to make test with latest curl version, I'll check above entries.
Best Regards, Lina. [0] https://cern.service-now.com/service-portal/view-incident.do?n=INC1930225 [1] https://cern.service-now.com/service-portal/view-incident.do?n=INC1930225 [2] https://cern.service-now.com/service-portal/article.do?n=KB0002246
@yuyiguo Yuyi,
In reference to:
Based in to above entries I understood that I should run:
curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW
Therefore, just to test this sentence for production, I've run it as is shown in [0].
Now my question is, which output is accepted, which one no? and which actions should I take based on the output?
Best Regards, Lina.
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW" -u "cmsweb"
vocms0136:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 100 140 0 0 3807 0 --:--:-- --:--:-- --:--:-- 3888[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0161:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 100 140 0 0 15042 0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0163:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 100 140 0 0 5566 0 --:--:-- --:--:-- --:--:-- 5833[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
vocms0165:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 100 140 0 0 15968 0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}]
[lijimene@aiadm87 ~]$
Lina, my recommendation is to check for presence "RAW" in curl output. If it is there it means DBS instance is responding with actual JSON. At this level we don't care how many RAW tiers we have, we just need to get a JSON. But if RAW is not in an output then it is a problem. SO, you need to trigger events/alerts when there is no RAW in an output. Best, Valentin.
On 0, Lina Jiménez notifications@github.com wrote:
@yuyi Yuyi,
In reference to:
- 'curl' installation for DBS's servers nodes, the command is already installed there.
- necessary tests to run when DBS got stuck
Based in to above entries I understood that I should run:
curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW
Therefore, just to test this sentence for production, I've run it as is shown in [0].Now my question is, which output is accepted, which one no? and which actions should I take based on the output?
Best Regards, Lina.
[lijimene@aiadm87 ~]$ broadcast -n "gp-prod" -c "curl -H "cms-auth-status":"NONE" http://localhost:8252/dbs/prod/global/DBSReader/datatiers?data_tier_name=RAW" -u "cmsweb" vocms0136: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 140 100 140 0 0 3807 0 --:--:-- --:--:-- --:--:-- 3888[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}] vocms0161: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 140 100 140 0 0 15042 0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}] vocms0163: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 140 100 140 0 0 5566 0 --:--:-- --:--:-- --:--:-- 5833[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}] vocms0165: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 140 100 140 0 0 15968 0 --:--:-- --:--:-- --:--:-- 17500[{"create_by": "/DC=org/DC=doegrids/OU=People/CN=Yuyi Guo 899208", "creation_date": 1176750788, "data_tier_id": 1, "data_tier_name": "RAW"}] [lijimene@aiadm87 ~]$
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/595#issuecomment-469245253
DBS server had "error: [Errno 24] too many open files" around 17/Feb/2019:20:07:04. This made DBS server cannot accept new connections from the client. Below are the email exchanges among developers and cmsweb admin. We will debug the system based on the suggestion below. However, the limit of file descriptor of 1024 is the default and it is too low for DBS. We will ask cmswen admin to increase it first.