dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Error changing site state #5300

Closed amaltaro closed 10 years ago

amaltaro commented 10 years ago

For the record, since I have no time to look at this issue now. We had CNAF in the RC DB as Down (but normal tasks/thresholds), then when you move it back to Normal, it throws errors.

[vocms142] /data/srv/wmagent/current > $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --normal
Executing wmagent-resource-control --site-name=T1_IT_CNAF --normal ...
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
ERROR:root:Cannot find siteName T1_IT_CNAF in the sitelist
[vocms142] /data/srv/wmagent/current > $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF -p
Executing wmagent-resource-control --site-name=T1_IT_CNAF -p ...
Thresholds and current status for all sites:

T1_IT_CNAF - 0 running, 0 pending, 2000 running slots total, 2000 pending slots total, Site is Normal:
etc etc etc

In the end, the operation is properly performed and the site is in Normal state. Agent version was 0.9.95b + patches.

lucacopa commented 10 years ago

When you change the state of a site, the condor plugin tries to update the list of sites where each job can run. There is 2 classAd that it checks: ExtDESIRED_Sites (where the job can run) and DESIRED_Sites (where the job will run). Basically it updates the list of sites in DESIRED_Sites, if the site is moved to Down, Drain or Aborted (exclude=True passed), remove it from that list, if it is the only site then append the job to a list of jobs to kill. When the site is moved to Normal (exclude=False), then I dont understand why are we doing this: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/PyCondorPlugin.py#L742 To me it sounds like it should be: siteName not in desiredSites and siteName in extDesiredSites then append the site to DESIRED_Sites list (append the site name to the list where the job will run if it was removed before)... The ERROR will basically be logged for every job where the site is not in the desiredSites lists @tsarangi Does it makes sense?

tsarangi commented 10 years ago

@lucacopa BUG, BUG, BUG... I can see that... :-)... Go ahead for the fix...