Jobs not removed from site disabled

vlimant commented 7 years ago

https://cms-gwmsmon.cern.ch/prodview/fabozzi_HIRun2015-HIOniaPeripheral30100-02May2016_758p4_170306_123931_1702

aren't the jobs supposed to be removed ?

amaltaro commented 7 years ago

Pending jobs are supposed to be removed IF they were assigned only to one site (the site that went into drain/down). Otherwise the job is only qedited and that site is taken out of the DESIRED_Sites.

About this one job on vocms0130 21927.182 1 256253 /data/srv/wmagent/v1.1.0.patch1/install/wmagent/JobCreator/JobCache/fabozzi_HIRun2015-HIOniaPeripheral30100-02May2016_758p4_170306_123931_1702/DataProcessing/JobCollection_240586_0/job_2605529/condor.21927.182.log T0_CH_CERN

short answer is, it was not submitted to condor after the site swapped the status and I have no clue of what exactly happened.

A longer answer is that this job was actually submitted on

QDate = 1489386335
cmst1@vocms0130:/data/srv/wmagent/current $ date -d @1489386335
Mon Mar 13 07:25:35 CET 2017

?!?!! But it's in the current status (Idle) for ~3 days. Thus my guess is that it was running when the site changed the status, it failed to completely run and condor somehow took care of it.

Scanning AgentStatusWatcher, there were no condor issues killing jobs on both drain and down changes. Scanning JobSubmitter, there were no T0 job submission after the site went into down status.

I suggest you to contact the glideinWMS team to figure out what exactly happened to this job (likely diving into the condor logs/history and whatnot)

vlimant commented 7 years ago

@amaltaro what do you mean by condor took care of it ? Why is the agent not removing this job from the queue ? If the agent is still waiting for that job (I think it is) it should take care of it to move the workflow forward, regardless of what happened in htcondor, which indeed we need to figure out a bit more.

amaltaro commented 7 years ago

Just to make it crystal clear, the agent only edit/remove jobs that are in JobStatus=1 (Idle), all the rest is not touched when a site changes its status.

Second important point to make is that the agent acts on a snapshot basis, i.e., once it sees the site status change, it fetches Idle jobs in condor and edit/remove them. After that (and the local DB update), the agent does not do anything else to that site or their jobs.

I assume you're asking for the agent to keep looking for jobs matching that site, and continuously edit/remove anything that has a site in Drain/Abort/Down (?) That's of course possible, but honestly I don't think this is the best approach, especially considering:

instabilities on condor schedd
almost half of the sites in the production agents are in drain

Diego is going to investigate this job. I'm almost sure it was running when the site changed its status and it somehow reappeared as idle later on (pilot dying or something like that).

vlimant commented 7 years ago

hey, you are probably right on the history of those jobs, but it turns out that this is a valid lifecycle (htcondor has the right to send back a running job to idled, for various reasons) for the job and we are ending up with a dangling workflow, waiting for a handful of jobs that will never run, and that will be removed in 5 days by the agent ... I don't think this is too good either. We have to find a way to deal with that.

ddavila0 commented 7 years ago

By a quick look in the logs of the job, it can be seen that the job was running at the time the site was set to "down". The job started to run on Mon Mar 20 07:41:24 CET 2017 then the schedd died at Tue Mar 21 10:40:04 CET 2017 causing a disconnection, then the job was set to Idle.

vlimant commented 7 years ago

https://cms-gwmsmon.cern.ch/prodview/pdmvserv_SMP-PhaseIIFall16DR82-00026_00044_v0__170316_145557_1828

vlimant commented 7 years ago

thanks @ddavila0 ! so @amaltaro what's next ?

amaltaro commented 7 years ago

I'll discuss this issue with @ticoann and probably set an automatic lookup & cleanup for sites that are not "enabled". It shall come in the next WMAgent release (April)

amaltaro commented 7 years ago

Jobs not removed from site disabled

amaltaro commented 3 years ago

Due to the lack of activity and/or prioritization of this issue, I consider it no longer relevant and we shall close it. Thanks!

dmwm / WMCore

Jobs not removed from site disabled #7753