Closed vlimant closed 3 years ago
Pending jobs are supposed to be removed IF they were assigned only to one site (the site that went into drain/down). Otherwise the job is only qedited and that site is taken out of the DESIRED_Sites.
About this one job on vocms0130 21927.182 1 256253 /data/srv/wmagent/v1.1.0.patch1/install/wmagent/JobCreator/JobCache/fabozzi_HIRun2015-HIOniaPeripheral30100-02May2016_758p4_170306_123931_1702/DataProcessing/JobCollection_240586_0/job_2605529/condor.21927.182.log T0_CH_CERN
short answer is, it was not submitted to condor after the site swapped the status and I have no clue of what exactly happened.
A longer answer is that this job was actually submitted on
QDate = 1489386335
cmst1@vocms0130:/data/srv/wmagent/current $ date -d @1489386335
Mon Mar 13 07:25:35 CET 2017
?!?!! But it's in the current status (Idle) for ~3 days. Thus my guess is that it was running when the site changed the status, it failed to completely run and condor somehow took care of it.
Scanning AgentStatusWatcher, there were no condor issues killing jobs on both drain and down changes. Scanning JobSubmitter, there were no T0 job submission after the site went into down status.
I suggest you to contact the glideinWMS team to figure out what exactly happened to this job (likely diving into the condor logs/history and whatnot)
@amaltaro what do you mean by condor took care of it ? Why is the agent not removing this job from the queue ? If the agent is still waiting for that job (I think it is) it should take care of it to move the workflow forward, regardless of what happened in htcondor, which indeed we need to figure out a bit more.
Just to make it crystal clear, the agent only edit/remove jobs that are in JobStatus=1 (Idle), all the rest is not touched when a site changes its status.
Second important point to make is that the agent acts on a snapshot basis, i.e., once it sees the site status change, it fetches Idle jobs in condor and edit/remove them. After that (and the local DB update), the agent does not do anything else to that site or their jobs.
I assume you're asking for the agent to keep looking for jobs matching that site, and continuously edit/remove anything that has a site in Drain/Abort/Down (?) That's of course possible, but honestly I don't think this is the best approach, especially considering:
Diego is going to investigate this job. I'm almost sure it was running when the site changed its status and it somehow reappeared as idle later on (pilot dying or something like that).
hey, you are probably right on the history of those jobs, but it turns out that this is a valid lifecycle (htcondor has the right to send back a running job to idled, for various reasons) for the job and we are ending up with a dangling workflow, waiting for a handful of jobs that will never run, and that will be removed in 5 days by the agent ... I don't think this is too good either. We have to find a way to deal with that.
By a quick look in the logs of the job, it can be seen that the job was running at the time the site was set to "down". The job started to run on Mon Mar 20 07:41:24 CET 2017 then the schedd died at Tue Mar 21 10:40:04 CET 2017 causing a disconnection, then the job was set to Idle.
thanks @ddavila0 ! so @amaltaro what's next ?
I'll discuss this issue with @ticoann and probably set an automatic lookup & cleanup for sites that are not "enabled". It shall come in the next WMAgent release (April)
Due to the lack of activity and/or prioritization of this issue, I consider it no longer relevant and we shall close it. Thanks!
https://cms-gwmsmon.cern.ch/prodview/fabozzi_HIRun2015-HIOniaPeripheral30100-02May2016_758p4_170306_123931_1702
aren't the jobs supposed to be removed ?