Closed mapellidario closed 2 years ago
As pointed out by Wa, this topic has been discussed also on "Facilities and Services" mattermost channel, here
Moreover, this has been reported on cms-talk: https://cms-talk.web.cern.ch/t/write-request-rejected-for-the-resubmitted-crab-jobs/11423/3
smells like a bug, I will try to reproduce and investigate
By simply looking at the FTS transfers logs I can not get any useful information.
Information about the tests
220616_123040:dmapelli_crab_20220616_143038
, crabserver ui grafana 861
.> voms-proxy-info --all
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dmapelli/CN=817881/CN=Dario Mapelli/CN=340218198
issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dmapelli/CN=817881/CN=Dario Mapelli
identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dmapelli/CN=817881/CN=Dario Mapelli
type : RFC3820 compliant impersonation proxy
strength : 2048
path : /tmp/x509up_u99307
timeleft : 10:10:07
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO : cms
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dmapelli/CN=817881/CN=Dario Mapelli
issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/TEAM/Role=NULL/Capability=NULL
attribute : /cms/integration/Role=NULL/Capability=NULL
attribute : /cms/itcms/Role=NULL/Capability=NULL
timeleft : 10:10:07
uri : voms2.cern.ch:15002
NB: the FTS logs do not last for long!
Dario, the proxy-info dump above shows a proxy w/o role nor group. Is that the proxy from lxplus which you used to submit ? By the way the original posting by the user was wrongly using the "voms role" term, voms proxies can have Roles and always list all possible groups. What people do when they "use a voms group" is actually to set one of the listed groups as PrimaryGroups. The relevant action is e.g.
voms-proxy-init -voms cms:/cms/itcms
and the way to verify is to look s at the voms FQAN's (I do not remember what that acronym means)
belforte@lxplus7111/~> voms-proxy-info -fqan
/cms/itcms/Role=NULL/Capability=NULL
/cms/ALARM/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/Role=NULL/Capability=NULL
the first line is the "Primary" one and you can see that it lists /cms/itcms group, while my usual proxy whould be
belforte@lxplus7111/~> voms-proxy-info -fqan
/cms/Role=NULL/Capability=NULL
/cms/ALARM/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/itcms/Role=NULL/Capability=NULL
belforte@lxplus7111/~>
i.e. the first line is /cms/Role=NULL/Capability=NULL
like in your output above
I have asked the original user to send details from one task, so at least we have some logs to start from https://github.com/dmwm/CRABClient/issues/5167#issuecomment-1159442255
Thanks Stefano for the explanation about how groups and roles in proxies work. I was a bit puzzled indeed by always finding all the groups and did not notice that the order was important!
I submitted a new task [1] with a proxy with /cms/itcms
as primary group [2], but i still can not find any difference between regular jobs and resubmitted jobs in the FTS transfer logs.
[1]
220620_090606:dmapelli_crab_20220620_110604
[2]
> voms-proxy-info -fqan
/cms/itcms/Role=NULL/Capability=NULL
/cms/TEAM/Role=NULL/Capability=NULL
/cms/integration/Role=NULL/Capability=NULL
/cms/Role=NULL/Capability=NULL
I do not think that FTS logs have information about group. The thing to do is to check the proxy in the scheduler SPOOL_DIR and trust FTS Client to do the right things with it. I can also review the fts_transfer code to make sure that it does not manipulate the proxy's FQANs.
I still would like to check the directory of the jobs where the problem was reported.
adding @KomalTauqeer
I looked at TaskWorker for last task processed for user ktauqeer
: 220531_103059:ktauqeer_crab_SingleElectron_2016H_v1
That task used this file as proxy
[crab3@crab-prod-tw01 ktauqeer]$ ls -l /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d
-rw-------. 1 crab3 crab3 13116 Jun 6 13:39 /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d
which has since expired, but had the correct VOMS group
[crab3@crab-prod-tw01 ktauqeer]$ voms-proxy-info -file /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d -all
subject : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402/CN=2168686702
issuer : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402
identity : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=2000922402
type : RFC compliant proxy
strength : 2048 bits
path : /data/certs/creds/4b13923881663e4ad95c500552796b82ea253a2d
timeleft : 0:00:00
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO cms extension information ===
VO : cms
subject : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer
issuer : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/dcms/Role=NULL/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft : 0:00:00
uri : lcg-voms2.cern.ch:15002
[crab3@crab-prod-tw01 ktauqeer]$
that task was executed on condor scheduler vocms0144, and in there the proxy used by condor and the FTS submission script is
/home/grid/ktauqeer/220531_103059:ktauqeer_crab_SingleElectron_2016H_v1/SPOOL_DIR/4b13923881663e4ad95c500552796b82ea253a2d
which gives
belforte@vocms0144/SPOOL_DIR> sudo voms-proxy-info -all -file 4b13923881663e4ad95c500552796b82ea253a2d
subject : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453/CN=1715894546
issuer : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453
identity : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer/CN=1701240947/CN=1458080835/CN=13493643/CN=3737603453
type : RFC compliant proxy
strength : 2048 bits
path : 4b13923881663e4ad95c500552796b82ea253a2d
timeleft : 15:54:31
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO : cms
subject : /C=DE/O=GermanGrid/OU=KIT/CN=Komal Tauqeer
issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/dcms/Role=NULL/Capability=NULL
timeleft : 15:54:31
uri : voms2.cern.ch:15002
Indeed the VOMS group information was lost. This explains the writing failures.
But I do not know (yet) how/why the proxy on the scheduler may have been reset to lack the group info.
I suspect something in the proxy renewal process and will try to dig details.
Something does not work as expected in proxy renewal. I think the problem was not the resubmit command, but that enough time has passed since initial submission. Of course I may also be misinterpreting logs... as usual this is not my code. I need to start by debugging proxy renewal in a test server and making sure that it does what's expected.
I looked at recent Jenkins vogroup test https://cmsweb-testbed.cern.ch/crabserver/ui/task/220621_074442%3Acmsbot_crab_voGroup the proxy file on the scheduler is correct (the test used itcms, not dcms, but this should not matter)
[crabtw@vocms0155 cluster84016664.proc0.subproc0]$ voms-proxy-info -all -file 17dcc40dd5623dcaa1ea25ecd018aeba2713e105
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751/CN=1799792336
issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751
identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot/CN=35588243/CN=614560011/CN=549004962/CN=3412173751
type : RFC compliant proxy
strength : 2048 bits
path : 17dcc40dd5623dcaa1ea25ecd018aeba2713e105
timeleft : 147:42:39
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO : cms
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbot/CN=545661/CN=Robot: CMS Build Bot
issuer : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/itcms/Role=NULL/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
timeleft : 147:42:39
uri : lcg-voms2.cern.ch:15002
[crabtw@vocms0155 cluster84016664.proc0.subproc0]$
I did a crab resubmit --force --jobidd=1
and the job in that task was resubmitted, but the proxy file on the condor scheduler was not modified (kept old date). Which agrees with my understanding of how crab resubmit
work. So a problem in proxy renewal is confirmed as the first suspect.
I think I found it.
It looks like "something has changed in HTCondor" which went un-noticed.
The relevant code is
https://github.com/dmwm/CRABServer/blob/87c538a02bccc3898be16177c84a565f167aa23f/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py#L178-L186
and what I find running this on the above task is that the if
does not work
(Pdb) ad['CRAB_UserGroup']
classad.classad.Value.Undefined
(Pdb) ad['CRAB_UserGroup'] != classad.Value.Undefined
undefined != undefined
while things would work using not ==
instead of !=
(Pdb) ad['CRAB_UserGroup'] == classad.Value.Undefined
True
(Pdb) not (ad['CRAB_UserGroup'] == classad.Value.Undefined)
False
(Pdb)
or maybe it is due to the move to python3. I think that classAd language has been reviewed in reced HTC versions, and here it looks to me that it is anyhow a matter of how the ==
and !=
operators are implemented in the classAd
class
https://github.com/dmwm/CRABServer/blob/87c538a02bccc3898be16177c84a565f167aa23f/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py#L8
I will test the change from
ad['CRAB_UserGroup'] != classad.Value.Undefined
to
not (ad['CRAB_UserGroup'] == classad.Value.Undefined)
and same for d['CRAB_UserRole']
but would be much better to have a proper understanding of the correct way to use classAds in python.
At first sight I could not find the needed detail in https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/classad.html
I have to go by "trial and error" here, it is fragile. Things may break again at any time in the future.
Maybe @mmascher or @shaleem know more ? Otherwise we may have to reach up to Madison.
hmm... this also works, and look cleaner:
(Pdb) ad['CRAB_UserGroup'] is not classad.Value.Undefined
False
(Pdb)
OK problem diagnosis confirmed and I verified that the change in previous comment works. But need to make sure that the new format will keep working !
python3 documentation would suggest to use !=
operator in this case https://realpython.com/python-is-identity-vs-equality/#comparing-the-python-comparison-operators
so I am not comfortable.
I have made the needed changes, but have also written to HTC support. Let's see. OTOH the change is very small and safe, so I am tempted to update RenewRemoteProxies.py in production and see how things go.
pointer to thread in HTC support forum: https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00039.shtml
here's the answer from HTCondor:
using "is" and "is not" does what you want, and is the correct syntax going forward.
read more at https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00041.shtml and in particular https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-May/msg00025.shtml also the final follow up https://lists.cs.wisc.edu/archive/htcondor-users/2022-June/msg00045.shtml with a promise to make this clear in the documentation.
it is amazing how every thing winds up to Brian !!!
So my PR https://github.com/dmwm/CRABServer/pull/7308 is "validated".
[copying from crabclient issue]
Dear Crab developers,
I have encountered a problem with the resubmitted jobs on crab. For the resubmitted jobs, the write request is rejected because the role doesn’t include dcms group as it was the case for the submitted jobs. For reference, please look at this grafana plot [1] where the drop to an efficiency of 0% for the ASO activity is higly correlated with my resubmission of jobs.
[1] https://monit-grafana.cern.ch/d/CIjJHKdGk/fts-transfers?from=now-7d&orgId=20&to=now&var-bin=1h&var-dst_country=All&var-dst_site=All&var-filters=data.dst_hostname|%3D|cmswebdav-kit.gridka.de&var-fts_server=All&var-group_by=activity&var-include=&var-protocol=All&var-src_country=All&var-src_site=All&var-staging=All&var-vo=cms&viewPanel=2
Best regards, Komal Tauqeer