dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Some CMS@Home jobs end up with the wrong HTCondor requirements #9736

Closed ivanreid closed 4 years ago

ivanreid commented 4 years ago

Impact of the bug This bug impacts CMS@Home jobs. They end up with a requirement that (AuthenticatedIdentity =!= "volunteer-node@cern.ch") which prevents them from being sent to Volunteer machines. These jobs sit in the "pending" queue for five days until a time-out leads to their resubmission, usually with the desired requirements. We use cmsweb-testbed.cern.ch and the agent is vocms0267.cern.ch.

Describe the bug CMS@Home jobs should have the HTCondor requirements ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )

However, some end up with ( ( ( REQUIRED_OS is "any" ) || ( GLIDEIN_REQUIRED_OS is "any" ) || stringListMember(GLIDEIN_REQUIRED_OS,REQUIRED_OS) ) && ( AuthenticatedIdentity isnt "volunteer-node@cern.ch" ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )

The added requirements would appear to come from https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L135

Initially, we thought that jobs with the added requirements were those which had failed the condor limit on retries and been passed back to WMAgent, which resubmitted them with the addition, but now we have evidence that in fact some of the initial job submissions already have them. Jobs with the added requirements stay in the pending queue, and are not distributed on request to Volunteer machines running BOINC tasks. If these are the only jobs in the queue, the request fails and the BOINC task terminates with an error. This then has consequences for the Volunteer (failed tasks generate no credit, and repeated failures lead to an automatic decrease in the daily quota of tasks the machine can run).

There is a five-day timeout in the pending queue, after which WMAgent resubmits the job. As far as we can tell these resubmissions usually have the correct job requirements, as after five days jobs in the particular workflow start running again, and the pending queue decreases. This five-day delay is somewhat excessive for our requirements (we aim to have jobs that run for 1-2 hours), and puts constraints on our workflow submissions.

How to reproduce it Submit a workflow to T3_CH_Volunteer and observe its behaviour with WMStats: python inject-test-wfs.py -u https://cmsweb-testbed.cern.ch -m DMWM -f TC_SLC7.json -c IDR_CMS_Home -r IDR_CMS_Home -t cms-home -a DMWM_Test -p CMS_Home_IDRv5d -s "T3_CH_Volunteer"

Expected behavior We expect that production jobs submitted to T3_CH_Volunteer will all be able to run on Volunteer machines. Note that post-production jobs (merge, log-collect, etc.) run on a VM farm on T3_CH_CMSAtHome; even if these jobs had the added requirement they would still run as T3_CH_CMSAtHome is not "volunteer-node@cern.ch".

amaltaro commented 4 years ago

@khurtado Kenyi, could you please have a look at this long standing issue?

AFAIU, production jobs submitted to the condor pool will have this

AuthenticatedIdentity isnt "volunteer-node@cern.ch"

in the requirement expression. However, when the agent retries a job (another condor_submit), apparently the expression above isn't present in the requirements string, which then make those jobs to happily run on volunteer resources.

I cannot see any flaw in the SimpleCondorPlugin. So, I wouldn't discard something internally to condor.

khurtado commented 4 years ago

@amaltaro I can't find any flaws either. Not sure what is going on, but given that the setting below seems to be the culprit and that it was meant to be temporary (until HLT added Require_OS logic in their configuration), I wonder if we can remove it already. I tried to check on pilots going to that resources, but this resource is exclusively running covid19 jobs for now, so there is no way to check. I heard from James that there will be a request to put this resource back in the pool next week though. Should we just wait for now?

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L131-L138

amaltaro commented 4 years ago

But then don't we rely on someone else setting the requirement string for us (SI folks)?

Yes, I think we can wait. Just in case, I have lowered the Pending timeout from 5 days to 1 day, then restarted JobStatusLite on vocms0267. @ivanreid FYI

ivanreid commented 4 years ago

Thanks, Alan, I'll see how that affects our pending queue. Just a slight correction to your first comment -- production jobs shouldn't have AuthenticatedIdentity isnt"volunteer-node@cern.ch". When they do, however it's assigned,, then they won't run in Volunteers' BOINC tasks. I submitted a new batch of 1,000 jobs last night, hoping that Federica would be able to check their requirements for the "bad" extras before they started running, but that doesn't seem to have been possible. Jobs from that batch are now running despite WMStats saying that there are still nearly 400 jobs pending in the previous batch. This implies a number of blocked jobs in the pending queue for that workflow.

ivanreid commented 4 years ago

The shortened time-out has had a positive effect on our "pending" queues. Workflows older than the current one have pending numbers in single digits (the newest of them was submitted 2-1/2 days ago). I'm still left puzzling how the "default" requirements created in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L135 get ANDed with the desired requirements. SimpleCondorPlugin always replaces self.reqStr rather than adding to it.

ivanreid commented 4 years ago

On reflection, one day is probably somewhat inefficient. Currently we are completing ~60 jobs/hour, or less than 1,500 per day. Since our pending queue size is set at 2,000 this means some jobs will be requeued even if they are runnable. Perhaps two days would be more suitable (until we get more volunteers running jobs :-).

amaltaro commented 4 years ago

Pending timeout increased to 2 days. Ivan, did you mean to close this issue? Or was it a mistake?

ivanreid commented 4 years ago

Thanks Alan, I notice we have been getting several hundred JobKills. Sorry about closing it prematurely, clicked on the wrong button again...

khurtado commented 4 years ago

Ok, it seems we can remove the following lines now: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L135-L137 as HLT already has that logic:

START = ifthenelse(WMAgent_AgentName =!= undefined,true,undefined) && ifthenelse(DESIRED_Sites =!= undefined,stringListMember(GLIDEIN_CMSSite,DESIRED_Sites),undefined) && (isUndefined(REQUIRED_OS) || (GLIDEIN_REQUIRED_OS =?= "any") || (REQUIRED_OS =?= "any") || (REQUIRED_OS =?= GLIDEIN_REQUIRED_OS))

@amaltaro Do we still need this last line? https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L138

In the past, SimpleCondorPlugin would handle the stripping of x509 proxies for jobs matchine T3_CH_Volunteer. This logic was moved to the schedds now. Considering T3_CH_Volunteer pilots have:

stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true

Do we still need that 138 line, or is that obsolete now? If obsolete, we basically wouldn't need self.reqStr at all, so we could get rid of it and the following: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L139-L140 of just initialize it to None and keep lines 139-140. Is config.bossAir.condorRequirementsString still used by the way?

Though removing self.reqStr would solve this issue, we should probably still try to understand what's going on though, in case this a symptom of something else going on (e.g.: condor bug?)

amaltaro commented 4 years ago

Kenyi, if we are 200% sure that the schedds are stripping the user certificate from jobs meant to execute at T3_CH_Volunteer resources, then I see no reason to keep that logic in the SimpleCondorPlugin.

We might have to synchronize some tests with Ivan though, he has a few cores serving CMS@Home jobs, so he could help us checking things from inside the job environment/runtime as well.

ivanreid commented 4 years ago

On Wed, 24 Jun 2020, Alan Malta Rodrigues wrote:

Kenyi, if we are 200% sure that the schedds are stripping the user certificate from jobs meant to execute at T3_CH_Volunteer resources, then I see no reason to keep that logic in the SimpleCondorPlugin.

We might have to synchronize some tests with Ivan though, he has a few cores serving CMS@Home jobs, so he could help us checking things from inside the job environment/runtime as well.

Fine; let me know when you want to do this.  Current workflows have

O(40 hrs) still to run, but I'm not sure how many in the pending queue are waiting on the two-day time-out.

-- Ivan Reid (ivan.reid@[brunel.ac.uk|cern.ch]) Engineering, Design & Physical Sciences CMS Collaboration, Brunel University London. Room TOWD405 CERN, Room 40-1-B12

ivanreid commented 4 years ago

I submitted a new workflow last night, as we were in danger of running low. Today, Federica was able to analyse all the pending jobs on vocms0267. As far as I can tell, none of the jobs in the new batch have been presented to a volunteer task -- they all have "CMS_JobRetryCount = 0". At the time, 1,848 of the (2,001) pending jobs were from this new batch, but 53 of them have the added requirement that's the subject of this issue. This confirms that some jobs (~3%) are initially created with the wrong requirement, not just jobs that are resubmitted for too many failures.

khurtado commented 4 years ago

Okay, investigating on this:

1) I can confirm resubmitted jobs are not the only jobs affected. In the example below, 69 out of 117 jobs had "CMS_JobRetryCount = 0".

[khurtado@vocms0267 ~]$ condor_q -constraint 'JobStatus == 1 && regexp(".AuthenticatedIdentity.", unparse(Requirements))' -af:hr CMS_JobRetryCount | sort | uniq -c
     69 0
     42 1
      5 2
      1 3

2) None of these problematic jobs have a ProcId = 0

[khurtado@vocms0267 ~]$ condor_q -constraint 'JobStatus == 1 && regexp(".AuthenticatedIdentity.", unparse(Requirements)) && ProcId == 0' -af:r ClusterId ProcId DESIRED_Sites | wc -l
0
[khurtado@vocms0267 ~]$ condor_q -constraint 'JobStatus == 1 && regexp(".AuthenticatedIdentity.", unparse(Requirements)) && ProcId != 0' -af:r ClusterId ProcId DESIRED_Sites | wc -l
117

3) If you look at the ClusterId of the jobs with issues, you can notice that the very first job ProcId = 0, is usually set to DESIRED_Sites or possibleSites = T3_CH_CMSAtHome, which is why it runs, and the others not.

[khurtado@vocms0267 ~]$ condor_history 306313 -af:h ClusterId ProcId DESIRED_Sites
ClusterId ProcId DESIRED_Sites
306313    0      T3_CH_CMSAtHome
[khurtado@vocms0267 ~]$ condor_q 306313 -af:h ClusterId ProcId DESIRED_Sites
ClusterId ProcId DESIRED_Sites
306313    1      T3_CH_Volunteer
306313    2      T3_CH_Volunteer

Not sure what this means, but it's as if only the first job in the group was considered for the logic of adding reqStr or not to the ClusterId.

We do make this change in a loop, reseting ad everytime though: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L518-L520

khurtado commented 4 years ago

OK. I think this is actually a condor bug. If I isolate the issue and submit 3 jobs in a cluster, the results are different if I try to modify 'Requirements' starting with the very first job in the list vs if it tries to modify it after. None of them are the desired behavior. In this issue, we match in the very first job, but the 'Requirements' is modified for all jobs within the same clusterId.

I have documented the bug below: https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7715

In any case, the PR below actually gets rid of the need of modify 'Requirements' https://github.com/dmwm/WMCore/pull/9767

So it should fix the issue, unless config.BossAir.condorRequirementsString is actually set here: https://github.com/dmwm/WMCore/pull/9767/files#diff-db6fcfce53a1897a040949c2657d39f3R131-R132

@amaltaro Is there still a user case for setting config.BossAir.condorRequirementsString? As things are now, we would need a solution for the condor ticket above. Otherwise, we can just get rid of that feature but keep following up with the condor team to have that bug fixed (or learn how to properly modify Requirements, in case we are doing it wrong in the code).

khurtado commented 4 years ago

@amaltaro @ivanreid #9767 should be ready to test. Let me know when the patch can be applied into vocms0267

ivanreid commented 4 years ago

It can be done any time as far as I'm concerned. Do I need to rundown or kill the queues, or can it be patched "live"? I had to submit a new workflow this morning; it seems that the popular SixTrack application on LHC@Home ran out of jobs, and a number of volunteers have CMS@Home set as their fall-back application. We went from 160 to 280 running jobs overnight and our queues nearly ran dry.