Closed belforte closed 5 years ago
Looking in schedd's again, condor_qedit does not timeout all of the times, only on a fraction of post jobs. As per elog entry, the error manifests itself as:
Fri, 07 Jun 2019 13:48:47 CEST(+0000):WARNING:HTCondorUtils Subprocess with PID 4051161 (executed in AuthenticatedSubprocess) timed out. Killing it.
So current theory is that maybe at times schedd is busy, or a few postjobs are running at same time and the operation does not complete in the 60 sec. timeout hardcoded in
HTCondorUtils.AuthenticatedSubprocess
Looking at the PJ code: https://github.com/dmwm/CRABServer/blob/3d6d3f5a9ad8bd1e392961cdf48dca74398dac44/src/python/TaskWorker/Actions/PostJob.py#L2649-L2656
one simple improvement could be to set all ads in a single schedd.edit call. which should save time. Also I do not know if it matters, but we do not need the authenticated suprocess with X509 authentication, since the PostJob runs with same userId as the job which it tries to edit, it is possible that local unix authentication is faster.
Then, we should check the frequency of the problem. If is was low enough, we may be able to get along with simply a few retries, I'd say up to 5, with random, increasing, intervals.
But, this is some sort of critical operation, if we fail not only the job status is not correct in dashboard, but we may end up again leaving the job in the queue forever which is not good. So we need some way to insure that the job is eventually removed from queue anyhow. See also this note https://cms-logbook.cern.ch/elog/GlideInWMS/6929 At the very least we need to catch the failure to report status update, although if can't report an exit code, nor set a classAd, I have no good idea atm on how to communicate this information.
I find it more and more appealing the idea not to mess with HTCondor, not touch LeaveJobInQueue and separtely report to MONIT even via a STOMP call from the schedd. As long as job bookkeeping is correct, we can alwasy write the new status information to some file and e.g. talk to MONIT in the task_wrapper. An alternative is to make PostJobs sort of try forever to update the classAds.
We need also to be aware that we shuld move from https://github.com/dmwm/CRABServer/blob/3d6d3f5a9ad8bd1e392961cdf48dca74398dac44/src/python/HTCondorUtils.py#L42 to the WMCore version as per #5706 where the timeout is currently hardcode to 1 hour ! https://github.com/dmwm/WMCore/blob/038b4ca738513a1fde5c30964b9a5531a74ccc47/src/python/WMCore/Services/PyCondor/PyCondorUtils.py#L28
@lecriste you may surely have a chat with @amaltaro about the timeout
we may also want to increase priority of #5706 possibly reassigning it
differently from CLI condor_qedit
, the python binding for schedd.edit
takes only one ad at a time
https://htcondor-python.readthedocs.io/en/latest/htcondor.html#htcondor.Schedd.edit
:-(
I have no idea if using schedd.transaction
around schedd.edit
will help or not, that's something where Condor expert input is needed.
I have no idea if using
schedd.transaction
aroundschedd.edit
will help or not, that's something where Condor expert input is needed.
@bbockelm, do you have any reccomendation here?
In section 1.3.2 ('Managing Jobs') of your HTCondor Python Documentation, the description of the Schedd.edit()
method reads:
Change an attribute for a set of jobs to a given expression. If invoked within a transaction, multiple calls to edit() are visible atomically.
Whats does 'atomically' mean exactly?
I have no idea if using
schedd.transaction
aroundschedd.edit
will help or not, that's something where Condor expert input is needed.
I gave it a try and things look better: the timed out showed up again for some postJobs (32 out of 8938) but all of them correctly edited their job ClassAds and so the corresponding jobs could move to the history. No job was Removed.
From the new graph I added to the dashboard, one can see the fast increasing number of Completed jobs in the brown schedd, corresponding to CRAB jobs entering the postJob step, which starts to slowly decrease once most of the task completed (i.e. no new Completed jobs) and postJobs terminate (letting HTCondor remove the job from the queue): https://monit-grafana.cern.ch/render/d-solo/000000185/cmsops-crabmetrics?orgId=11&from=1560263162699&to=1560285999063&var-Schedds=All&var-T1_Sites=All&panelId=20&width=1000&height=500&tz=Europe%2FZurich
More details: the timed out occurred in bunches: at 18:12:07, 20:14:31, 21:16:10, 22:18:10, 23:19:20 CEST(+0000) on vocms0196 and all such postJob logs have the usual warning:
Tue, 11 Jun 2019 17:36:35 CEST(+0000):WARNING:PostJob Unable to determine ASO start time from job report. Will use ASO start time = 2019-06-11 17:36:35.582054 (1560267395).
and this Traceback:
Tue, 11 Jun 2019 18:12:18 CEST(+0000):ERROR:PostJob Failure during post-job execution.
Traceback (most recent call last):
File "TaskWorker/Actions/PostJob.py", line 1574, in execute
retval, retmsg = self.execute_internal()
File "TaskWorker/Actions/PostJob.py", line 1973, in execute_internal
return self.check_retry_count(mostCommon(self.stageout_exit_codes, 60324)), retmsg
File "TaskWorker/Actions/PostJob.py", line 2764, in check_retry_count
self.set_state_ClassAds('COOLOFF', exitCode=exitCode)
File "TaskWorker/Actions/PostJob.py", line 2658, in set_state_ClassAds
self.logger.info("====== Finished to update classAds.")
File "HTCondorUtils.py", line 75, in __exit__
self.wpipe.close()
IOError: [Errno 32] Broken pipe
@lecriste -
Whats does 'atomically' mean exactly?
It means that all the edits show up at once. Other people doing condor_q
will either see none of the edits or all of them.
Also, each "transaction" requires a fsync
on the filesystem. If you do 1000 edits all separately, that's 1000 fsync
's. If you do 1000 edits inside a single transaction, that's 1 fsync
.
@bbockelm thanks. I guss the question was not sowhat atomically means in general, but rather whether the text meant "updated toghether in a single atomic transaction" or "each of them is updated in a single atomica transaction". Thanks for the tip about fsync
, we surely want to minimize those on our schedd's.
@lecriste looks to me that we want to do all of:
we also asked Madison, here's from Jaime:
On 14/06/2019 23:07, Jaime Frey wrote:> Here are a few thoughts on how to make your current setup work:
> * Have the post script retry a few times if condor_qedit fails. It should sleep some time in between attempts.
> * Set the initial leave_in_queue expression so that the job can leave the queue after it’s been in Completed status for more than a certain amount of time (say 1 hour).
> * Change the too-many-completed-jobs alert to ignore jobs whose leave_in_queue evaluates to true.
>
> Even with these changes, this data from the post script can still get lost. I’ll have to think some more about a better solution.
>
> - Jaime
So I think we still want to do 1.-3. as per previous comment in https://github.com/dmwm/CRABServer/issues/5907#issuecomment-501450029 to be specific I suggest following retry policy, but any similar on will do, the important things are to increase delay at each iteration and have a random component
(for 1-4 I suggest to log a Warning)
AND when submitting a job, set the LeaveJobInQueue to an expression which evaluates to False 26 (or 30?)hours after job is completed, along the lines of
leavejobinqueue = (jobstatus==4 || jobstatus==3) && (time() - enteredcurrentstats > 30*3600)
here is an example of such code https://github.com/dmwm/CRABServer/blob/3c59bc12256b99de2414692731c93463cad64ae9/src/python/TaskWorker/Actions/DagmanCreator.py#L169-L176
by the way, I suggest to increase timeout from 1min to 3min. Even if I suspect that if schedd is busy we get an errore before the timeout.
A few other thoughts:
LeaveJobInQueue
is set to true.
LeaveJobInQueue
evaluate to false
if no updates from a post script have occurred in 2 hours.AuthenticatedSubprocess
here; that will reduce the load on the host (by no forking the process).+1
With the current code PostJob managed to set/edit the ClassAds without any failure (for 2 jobs out of 9000 all of the five attempts were needed). The exception raised is always this one:
Traceback (most recent call last):
File "TaskWorker/Actions/PostJob.py", line 2657, in set_state_ClassAds
with self.schedd.transaction():
RuntimeError: Failed to connect to schedd.
Now I will work on the expression for LeaveJobInQueue.
I would have been happy with 2 failures overall. But 2/9k jobs needing 5 retries worries me when looking at 200k~400k jobs/day. Were the 9k jobs on a single task/schedd ? Which schedd ? Was it very loaded and busy ?
We can surely start with this as a way to test grafana dashboard and gave users some more or less working fallback. Then maybe we've been unlucky. But if this error rate stays we need to do something.
What is the failure rate with zero retries?
As best as I can understand, this is a reasonable amount of load and we shouldn’t see many issues.
Might be useful to set D_FULLDEBUG on the schedd and try to correlate with the transaction failures.
Is it possible to enable the condor debugging on the python client as well?
Is any of this code threaded? Periodically there have been reports of threading issues here.
I would have been happy with 2 failures overall. But 2/9k jobs needing 5 retries worries me when looking at 200k~400k jobs/day. Were the 9k jobs on a single task/schedd ?
Yes: 190619_184040:lecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_randomRetry_noMaxIdleForReal_catchExc_highP
Which schedd ?
vocms0199
Was it very loaded and busy ?
Very loaded for sure as I am sending all my tasks there on puropose (see idle and completed+removed job graphs), although not many running jobs: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&from=1560875549249&to=1561048349250&refresh=1m&var-Schedds=All&var-T1_Sites=All
We can surely start with this as a way to test grafana dashboard and gave users some more or less working fallback.
Agree.
Then maybe we've been unlucky. But if this error rate stays we need to do something.
Looks like increasing the number of attempts can be a solution.
What is the failure rate with zero retries?
175 over 9000. Out of these 175, the number of jobs succeeding at the:
As best as I can understand, this is a reasonable amount of load and we shouldn’t see many issues.
Might be useful to set D_FULLDEBUG on the schedd and try to correlate with the transaction failures.
Is it possible to enable the condor debugging on the python client as well?
Need to check.
Is any of this code threaded?
I don't think so.
Periodically there have been reports of threading issues here.
no threads.
but several PostJob scripts may be running at same time. Maybe we can use JobRouter to sort of aggregate changes over 10 min intervals and then do the classAd editing with the best technoloy ? I guess we start like this sine monitoring transition is high prio at last, and we need get more statistics.
but several PostJob scripts may be running at same time.
Yes, currently we have a limit of 20 postJob scripts per task running simultaneously.
Looking at schedd duty cycle for vocms0199 from https://cms-htcondor-monitor.web.cern.ch/cms-htcondor-monitor/letts/aperezca/HTML/Schedds/crab3@vocms0199.cern.ch_status_168h.html
I am not that surprised that some condor_* command failed. The question at this point is rather why DC was so bad for so long. Are the test jobs "odd" or is all that condor_qedit the reason ? If the latter, this is a no go :-(
The exception is raised at the first line here:
with self.schedd.transaction():
for param in params:
self.schedd.edit([self.dag_jobid], param, str(params[param]))
self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))
# Once state ClassAds have been updated, let HTCondor remove the job from the queue
self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))
so it might even not execute any edit.
These are the test tasks I submitted to vocms0199: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?orgId=11&from=1560501891763&to=1561106691764&refresh=5m&var-user=lecriste&var-site=All&var-current_url=%2Fd%2FcmsTMDetail%2Fcms_task_monitoring&var-task=All
Yes the exception means that it can't talk to schedd. The question us whether the other thousands of calls that instead succeed are tge reason for the duty cycle at 99%. Schedd can't work properly with that dcsent from Stefano's mobileOn Jun 21, 2019 10:45, Leonardo Cristella notifications@github.com wrote:The exception is raised at the first line here: with self.schedd.transaction(): for param in params: self.schedd.edit([self.dag_jobid], param, str(params[param])) self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))
self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))
so it might even not execute any edit. These are the test tasks I submitted to vocms0199: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?orgId=11&from=1560501891763&to=1561106691764&refresh=5m&var-user=lecriste&var-site=All&var-current_url=%2Fd%2FcmsTMDetail%2Fcms_task_monitoring&var-task=All
—You are receiving this because you were assigned.Reply to this email directly, view it on GitHub, or mute the thread.
Hi,
This is strange. Are there other tests ongoing? Is the PostJob code here being run multiple times?
These should be about the same amount of work as a shadow update, which every running job does every few months minutes.
Can we try again with the transaction code but skipping the edits?
When you start the next test, leave a note in the ticket and I will try to log in to the schedd and take a look.
Brian
This is strange. Are there other tests ongoing?
There is a total of 3k running jobs from stadnard user tasks on average, see vocms0199 here: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&refresh=1m&from=1560945268469&to=1561118068469&var-Schedds=All&var-T1_Sites=All&panelId=1&fullscreen
Is the PostJob code here being run multiple times?
These should be about the same amount of work as a shadow update, which every running job does every few months minutes.
Can we try again with the transaction code but skipping the edits?
Here is the task: https://cmsweb-testbed.cern.ch/crabserver/ui/task/190621_114801%3Alecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_randomRetry_noMaxIdle_BrianTest_highP
When you start the next test, leave a note in the ticket and I will try to log in to the schedd and take a look.
It will take a while for the first jobs to run.
@bbockelm, does it make sense to try the NonDurable
TransactionFlag?
class htcondor.TransactionFlags
Flags affecting the characteristics of a transaction. NonDurable Non-durable transactions are changes that may be lost when thecondor_schedd
crashes.NonDurable
is used for performance, as it eliminates extrafsync()
calls.
@lecriste - that would provide a small improvement, but I'm still confused as to why this would be noticeable load in the first place.
@belforte - where did you get the DC duty cycle graph fraom? I don't see it from @lecriste 's links.
ARGH. I think I see the problem; it is here:
https://github.com/htcondor/htcondor/blob/master/src/python-bindings/schedd.cpp#L1834
It's missing the SetAttribute_NoAck
flag. That means we have a round trip between schedd and the python client for each attribute set (and what is probably killing performance).
If the variant where you "create transaction and then skip edits" tests out fine, then we will have the condor dev get you guys a special build.
ARGH. I think I see the problem; it is here:
https://github.com/htcondor/htcondor/blob/master/src/python-bindings/schedd.cpp#L1834
It's missing the
SetAttribute_NoAck
flag. That means we have a round trip between schedd and the python client for each attribute set (and what is probably killing performance).If the variant where you "create transaction and then skip edits" tests out fine, then we will have the condor dev get you guys a special build.
Difficult to say now: I need to send a few more tasks and wait for a significant amount of jobs to complete in order to "reproduce" the same load.
Brian i have put an url below the plot. This is the stuff which Antonio put together. Sorry am in flight tiday. Can't do much
Recent DC still looks bad. @lecriste you should make sure that it is normalized before submitting again. https://cms-htcondor-monitor.web.cern.ch/cms-htcondor-monitor/letts/aperezca/HTML/Schedds/crab3@vocms0199.cern.ch_status_24h.html
Recent Duty Core cycle on vomcs0199 is reasonable since of 6am today. Now it seems the right time to submit tests.
@bbockelm in order to confirm your current hypothesis that extra load is generated by a round trip for each attribute rather than one overall, maybe we should use a transaction with one single ad edit, rather than an empty one ? In case I'd start by only resetting LeaveJobInQueue, so we can verify that job/tasks handling is correct and then it will only be a matter of adding what's useful for monitoring once we have a patched condor. Of course we can do in steps: empty - one add edit - multiple adds edit.
For sake of controlled test (and me being busy with family) I'd rather not submit load myself. @lecriste if it gets impossible/difficult for you to keep working on this, let me know and send instructions on what you did so I may try to reproduce it.
I have just submitted two more tasks (with 9k jobs each), one with no ad edit and one with a single ad edit (LeaveJobInQueue = false
), to the same schedd.
In the task with no ad edit that I submitted last night (when DC was still high), 5 jobs out of 9k needed all the five attempts to execute these lines:
with self.schedd.transaction(htcondor.TransactionFlags.NonDurable):
#for param in params:
# self.schedd.edit([self.dag_jobid], param, str(params[param]))
#self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))
## Once state ClassAds have been updated, let HTCondor remove the job from the queue
#self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))
self.logger.info(" -----> Finished %s -----", msg)
self.logger.info("====== Finished to update ClassAds.")
break
i still believe that errors have to be expected when duty cycle is >98%
Can you do one task at a time ? Or two of the same kind ? Otherwise we do not learn anything. The point is to find what drives schedd in overload. Not which command works in spite of that.
Duty cicle reached again 1 after the recent two tasks submission: I am not killing the current tasks in case @bbockelm wants to debug more.
I will submit another single task (with no ad edit) as soon as DC becomes normal again.
In case it helps, I noticed a few jobs (for instance 24805218.0) being removed with
RemoveReason = "The system macro SYSTEM_PERIODIC_REMOVE expression '( ( ( JobUniverse =!= 7 ) && ( JobUniverse =!= 12 ) && ( ( ( NumJobStarts > 9 ) =?= true ) || ( ( NumShadowStarts > 19 ) =?= true ) ) ) ) || ( ( DiskUsage > 27000000 ) =?= true )' evaluated to TRUE"
and indeed:
NumShadowStarts = 28
I have no experience with problems with NumShadowStarts, can't guess what's going on. But maybe it is a consequence of schedd being too busy to do all that's asked of it.
The duty cycle became again reasonable as of 7am. It seems correlated to the trend in this graph: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&from=1561208249917&to=1561294649917&refresh=5m&var-Schedds=All&var-T1_Sites=All&panelId=20&fullscreen
which was due mostly to Removed jobs.
I have just submitted another task with no ad edit: https://cmsweb-testbed.cern.ch/crabserver/ui/task/190623_125129%3Alecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_noMaxIdleAndPJ_BrianTest_NonDurable_normDC2_highP
vocms0199 appears to be fully unresponsivel. I loogged in and found that / partition is 100% full. belforte@vocms0199/condor> df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 160G 160G 20K 100% / belforte@vocms0199/condor>
On the non trivial side, one week ago we had the classAdd editing enabled on all schedd's and while some were failing so that we ended up with the too many classads in "complete" due to the leavejobinqueue not being turned to false, many succeeded and we had no DutyCycle problem. Whay all those problems on vocms0199 now ? Is it really the condor_qedit ?
I submitted a single task on a "clean" schedd (vocms0194) with this code: https://github.com/dmwm/CRABServer/blob/d8c924540e5b57aa779dff425ce0975a37c40778/src/python/TaskWorker/Actions/PostJob.py#L2658-L2666
Only one postJob out of 9k failed on the first attempt with the usual error:
Traceback (most recent call last): File "TaskWorker/Actions/PostJob.py", line 2658, in set_state_ClassAds with self.schedd.transaction(htcondor.TransactionFlags.NonDurable): RuntimeError: Failed to connect to schedd.
Here is the trend of completed jobs in the queue: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&refresh=5m&from=1561473394311&to=1561646194311&var-Schedds=crab3%40vocms0194.cern.ch&var-T1_Sites=All&panelId=20&fullscreen
and this the Duty Cycle:
Completed jobs started to leave the queue today around 13h, when this expression evalauted to True:
LeaveJobInQueue = ifThenElse(( JobStatus =?= 4 || JobStatus =?= 3 ) && ( time() - EnteredCurrentStatus < 30 * 60 * 60 ),true,false)
Do I understand correctly that means good?Sent from Stefano's mobile
Well, the DC close to 1 when postjob scripts started (at most 20 simultaneously) does not look good to me. And all Schedd.edit() were disabled, only the Schedd.transaction() was executed. According to @bbockelm this should not happen?
ah.. that's not so good. I could not read the time correlation in the plot, sorry. I guess you want to compare with not calling the schedd transation, is that the reason of the hihg DutyCycle, or is it simply due to so many jobs completing at same time ? 9k jobs comoleting in a couple of hours (or less) is not usual. Again... when all jobs were doing all edits even w/o the transaction we suffered from the fraction which failed and stick in queue, not from high DC.
ah.. that's not so good. I could not read the time correlation in the plot, sorry. I guess you want to compare with not calling the schedd transation, is that the reason of the hihg DutyCycle, or is it simply due to so many jobs completing at same time ?
The aim of the test was: let's try with the lightest configuration, i.e. a Schedd.transaction()
without Schedd.edit()
, and see how the schedd reacts, as you suggested too.
9k jobs comoleting in a couple of hours (or less) is not usual. Again... when all jobs were doing all edits even w/o the transaction we suffered from the fraction which failed and stick in queue, not from high DC.
Didn't we say that jobs stuck in the queue (Completed or Removed) have no effect? I do not remember the DC trend with that situation.
see https://cms-logbook.cern.ch/elog/GlideInWMS/6927 and the full thread there. Also related to #5906