dmwm / CRABServer

15 stars 38 forks source link

Wrong VO group when resubmitting CRAB jobs #7288

Closed mapellidario closed 2 years ago

mapellidario commented 2 years ago

[copying from crabclient issue]

Dear Crab developers,

I have encountered a problem with the resubmitted jobs on crab. For the resubmitted jobs, the write request is rejected because the role doesn’t include dcms group as it was the case for the submitted jobs. For reference, please look at this grafana plot [1] where the drop to an efficiency of 0% for the ASO activity is higly correlated with my resubmission of jobs.

[1] https://monit-grafana.cern.ch/d/CIjJHKdGk/fts-transfers?from=now-7d&orgId=20&to=now&var-bin=1h&var-dst_country=All&var-dst_site=All&var-filters=data.dst_hostname|%3D|cmswebdav-kit.gridka.de&var-fts_server=All&var-group_by=activity&var-include=&var-protocol=All&var-src_country=All&var-src_site=All&var-staging=All&var-vo=cms&viewPanel=2

Best regards, Komal Tauqeer

mapellidario commented 2 years ago

As pointed out by Wa, this topic has been discussed also on "Facilities and Services" mattermost channel, here

Moreover, this has been reported on cms-talk: https://cms-talk.web.cern.ch/t/write-request-rejected-for-the-resubmitted-crab-jobs/11423/3

belforte commented 2 years ago

smells like a bug, I will try to reproduce and investigate

mapellidario commented 2 years ago

By simply looking at the FTS transfers logs I can not get any useful information.

Information about the tests

NB: the FTS logs do not last for long!

belforte commented 2 years ago

Dario, the proxy-info dump above shows a proxy w/o role nor group. Is that the proxy from lxplus which you used to submit ? By the way the original posting by the user was wrongly using the "voms role" term, voms proxies can have Roles and always list all possible groups. What people do when they "use a voms group" is actually to set one of the listed groups as PrimaryGroups. The relevant action is e.g.

voms-proxy-init -voms cms:/cms/itcms

and the way to verify is to look s at the voms FQAN's (I do not remember what that acronym means)

belforte@lxplus7111/~> voms-proxy-info -fqan
/cms/itcms/Role=NULL/Capability=NULL
/cms/ALARM/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/Role=NULL/Capability=NULL

the first line is the "Primary" one and you can see that it lists /cms/itcms group, while my usual proxy whould be

belforte@lxplus7111/~> voms-proxy-info -fqan
/cms/Role=NULL/Capability=NULL
/cms/ALARM/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/itcms/Role=NULL/Capability=NULL
belforte@lxplus7111/~> 

i.e. the first line is /cms/Role=NULL/Capability=NULL like in your output above

belforte commented 2 years ago

I have asked the original user to send details from one task, so at least we have some logs to start from https://github.com/dmwm/CRABClient/issues/5167#issuecomment-1159442255

mapellidario commented 2 years ago

Thanks Stefano for the explanation about how groups and roles in proxies work. I was a bit puzzled indeed by always finding all the groups and did not notice that the order was important!

I submitted a new task [1] with a proxy with /cms/itcms as primary group [2], but i still can not find any difference between regular jobs and resubmitted jobs in the FTS transfer logs.

[1]

[2]

> voms-proxy-info -fqan
/cms/itcms/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/Role=NULL/Capability=NULL
belforte commented 2 years ago

I do not think that FTS logs have information about group. The thing to do is to check the proxy in the scheduler SPOOL_DIR and trust FTS Client to do the right things with it. I can also review the fts_transfer code to make sure that it does not manipulate the proxy's FQANs.

I still would like to check the directory of the jobs where the problem was reported.

belforte commented 2 years ago

adding @KomalTauqeer

I looked at TaskWorker for last task processed for user ktauqeer : 220531_103059:ktauqeer_crab_SingleElectron_2016H_v1

That task used this file as proxy

[crab3@crab-prod-tw01 ktauqeer]$ ls -l /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d
-rw-------. 1 crab3 crab3 13116 Jun  6 13:39 /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d

which has since expired, but had the correct VOMS group

[crab3@crab-prod-tw01 ktauqeer]$ voms-proxy-info -file /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d -all
subject   : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402/CN=2168686702
issuer    : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402
identity  : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402
type      : RFC compliant proxy
strength  : 2048 bits
path      : /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d
timeleft  : 0:00:00
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/dcms/Role=NULL/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 0:00:00
uri       : lcg-voms2.cern.ch:15002
[crab3@crab-prod-tw01 ktauqeer]$

that task was executed on condor scheduler vocms0144, and in there the proxy used by condor and the FTS submission script is

/home/grid/ktauqeer/220531_103059:ktauqeer_crab_SingleElectron_2016H_v1/SPOOL_DIR/4b13923881663e4ad95c500552796b82ea253a2d

which gives

belforte@vocms0144/SPOOL_DIR> sudo voms-proxy-info -all -file 4b13923881663e4ad95c500552796b82ea253a2d
subject   : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453/CN=1715894546
issuer    : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453
identity  : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453
type      : RFC compliant proxy
strength  : 2048 bits
path      : 4b13923881663e4ad95c500552796b82ea253a2d
timeleft  : 15:54:31
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/dcms/Role=NULL/Capability=NULL
timeleft  : 15:54:31
uri       : voms2.cern.ch:15002

Indeed the VOMS group information was lost. This explains the writing failures.

But I do not know (yet) how/why the proxy on the scheduler may have been reset to lack the group info.

I suspect something in the proxy renewal process and will try to dig details.

belforte commented 2 years ago

Something does not work as expected in proxy renewal. I think the problem was not the resubmit command, but that enough time has passed since initial submission. Of course I may also be misinterpreting logs... as usual this is not my code. I need to start by debugging proxy renewal in a test server and making sure that it does what's expected.

belforte commented 2 years ago

I looked at recent Jenkins vogroup test https://cmsweb-testbed.cern.ch/crabserver/ui/task/220621_074442%3Acmsbot_crab_voGroup the proxy file on the scheduler is correct (the test used itcms, not dcms, but this should not matter)

[crabtw@vocms0155 cluster84016664.proc0.subproc0]$ voms-proxy-info -all -file 17dcc40dd5623dcaa1ea25ecd018aeba2713e105
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751/CN=1799792336
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751
type      : RFC compliant proxy
strength  : 2048 bits
path      : 17dcc40dd5623dcaa1ea25ecd018aeba2713e105
timeleft  : 147:42:39
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/itcms/Role=NULL/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 147:42:39
uri       : lcg-voms2.cern.ch:15002
[crabtw@vocms0155 cluster84016664.proc0.subproc0]$
belforte commented 2 years ago

I did a crab resubmit --force --jobidd=1 and the job in that task was resubmitted, but the proxy file on the condor scheduler was not modified (kept old date). Which agrees with my understanding of how crab resubmit work. So a problem in proxy renewal is confirmed as the first suspect.

belforte commented 2 years ago

I think I found it.

It looks like "something has changed in HTCondor" which went un-noticed. The relevant code is https://github.com/dmwm/CRABServer/blob/87c538a02bccc3898be16177c84a565f167aa23f/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py#L178-L186 and what I find running this on the above task is that the if does not work

(Pdb) ad['CRAB_UserGroup']
classad.classad.Value.Undefined
(Pdb) ad['CRAB_UserGroup'] != classad.Value.Undefined
undefined != undefined

while things would work using not == instead of !=

(Pdb) ad['CRAB_UserGroup'] == classad.Value.Undefined
True
(Pdb) not (ad['CRAB_UserGroup'] == classad.Value.Undefined)
False
(Pdb) 

or maybe it is due to the move to python3. I think that classAd language has been reviewed in reced HTC versions, and here it looks to me that it is anyhow a matter of how the == and != operators are implemented in the classAd class https://github.com/dmwm/CRABServer/blob/87c538a02bccc3898be16177c84a565f167aa23f/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py#L8

I will test the change from

ad['CRAB_UserGroup'] != classad.Value.Undefined

to

not (ad['CRAB_UserGroup'] == classad.Value.Undefined)

and same for d['CRAB_UserRole'] but would be much better to have a proper understanding of the correct way to use classAds in python.

belforte commented 2 years ago

At first sight I could not find the needed detail in https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/classad.html

I have to go by "trial and error" here, it is fragile. Things may break again at any time in the future.

Maybe @mmascher or @shaleem know more ? Otherwise we may have to reach up to Madison.

belforte commented 2 years ago

hmm... this also works, and look cleaner:

(Pdb) ad['CRAB_UserGroup'] is not classad.Value.Undefined
False
(Pdb) 
belforte commented 2 years ago

OK problem diagnosis confirmed and I verified that the change in previous comment works. But need to make sure that the new format will keep working !

belforte commented 2 years ago

python3 documentation would suggest to use != operator in this case https://realpython.com/python-is-identity-vs-equality/#comparing-the-python-comparison-operators so I am not comfortable.

belforte commented 2 years ago

I have made the needed changes, but have also written to HTC support. Let's see. OTOH the change is very small and safe, so I am tempted to update RenewRemoteProxies.py in production and see how things go.

belforte commented 2 years ago

pointer to thread in HTC support forum: https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00039.shtml

belforte commented 2 years ago

here's the answer from HTCondor:

using "is" and "is not" does what you want, and is the correct syntax going forward.

read more at https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00041.shtml and in particular https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-May/msg00025.shtml also the final follow up https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00045.shtml with a promise to make this clear in the documentation.

it is amazing how every thing winds up to Brian !!!

So my PR https://github.com/dmwm/CRABServer/pull/7308 is "validated".