dmwm / CRABServer

16 stars 38 forks source link

PostJob fails to do condor_qedit to set status ads and reset LeaveJobInQueue #5907

Closed belforte closed 5 years ago

belforte commented 5 years ago

see https://cms-logbook.cern.ch/elog/GlideInWMS/6927 and the full thread there. Also related to #5906

belforte commented 5 years ago

Looking in schedd's again, condor_qedit does not timeout all of the times, only on a fraction of post jobs. As per elog entry, the error manifests itself as:

Fri, 07 Jun 2019 13:48:47 CEST(+0000):WARNING:HTCondorUtils Subprocess with PID 4051161 (executed in AuthenticatedSubprocess) timed out. Killing it.

So current theory is that maybe at times schedd is busy, or a few postjobs are running at same time and the operation does not complete in the 60 sec. timeout hardcoded in HTCondorUtils.AuthenticatedSubprocess

Looking at the PJ code: https://github.com/dmwm/CRABServer/blob/3d6d3f5a9ad8bd1e392961cdf48dca74398dac44/src/python/TaskWorker/Actions/PostJob.py#L2649-L2656

one simple improvement could be to set all ads in a single schedd.edit call. which should save time. Also I do not know if it matters, but we do not need the authenticated suprocess with X509 authentication, since the PostJob runs with same userId as the job which it tries to edit, it is possible that local unix authentication is faster.

Then, we should check the frequency of the problem. If is was low enough, we may be able to get along with simply a few retries, I'd say up to 5, with random, increasing, intervals.

But, this is some sort of critical operation, if we fail not only the job status is not correct in dashboard, but we may end up again leaving the job in the queue forever which is not good. So we need some way to insure that the job is eventually removed from queue anyhow. See also this note https://cms-logbook.cern.ch/elog/GlideInWMS/6929 At the very least we need to catch the failure to report status update, although if can't report an exit code, nor set a classAd, I have no good idea atm on how to communicate this information.

I find it more and more appealing the idea not to mess with HTCondor, not touch LeaveJobInQueue and separtely report to MONIT even via a STOMP call from the schedd. As long as job bookkeeping is correct, we can alwasy write the new status information to some file and e.g. talk to MONIT in the task_wrapper. An alternative is to make PostJobs sort of try forever to update the classAds.

belforte commented 5 years ago

We need also to be aware that we shuld move from https://github.com/dmwm/CRABServer/blob/3d6d3f5a9ad8bd1e392961cdf48dca74398dac44/src/python/HTCondorUtils.py#L42 to the WMCore version as per #5706 where the timeout is currently hardcode to 1 hour ! https://github.com/dmwm/WMCore/blob/038b4ca738513a1fde5c30964b9a5531a74ccc47/src/python/WMCore/Services/PyCondor/PyCondorUtils.py#L28

@lecriste you may surely have a chat with @amaltaro about the timeout

we may also want to increase priority of #5706 possibly reassigning it

belforte commented 5 years ago

differently from CLI condor_qedit, the python binding for schedd.edit takes only one ad at a time https://htcondor-python.readthedocs.io/en/latest/htcondor.html#htcondor.Schedd.edit :-(

belforte commented 5 years ago

I have no idea if using schedd.transaction around schedd.edit will help or not, that's something where Condor expert input is needed.

lecriste commented 5 years ago

I have no idea if using schedd.transaction around schedd.edit will help or not, that's something where Condor expert input is needed.

@bbockelm, do you have any reccomendation here? In section 1.3.2 ('Managing Jobs') of your HTCondor Python Documentation, the description of the Schedd.edit() method reads:

Change an attribute for a set of jobs to a given expression. If invoked within a transaction, multiple calls to edit() are visible atomically.

Whats does 'atomically' mean exactly?

lecriste commented 5 years ago

I have no idea if using schedd.transaction around schedd.edit will help or not, that's something where Condor expert input is needed.

I gave it a try and things look better: the timed out showed up again for some postJobs (32 out of 8938) but all of them correctly edited their job ClassAds and so the corresponding jobs could move to the history. No job was Removed.

From the new graph I added to the dashboard, one can see the fast increasing number of Completed jobs in the brown schedd, corresponding to CRAB jobs entering the postJob step, which starts to slowly decrease once most of the task completed (i.e. no new Completed jobs) and postJobs terminate (letting HTCondor remove the job from the queue): https://monit-grafana.cern.ch/render/d-solo/000000185/cmsops-crabmetrics?orgId=11&from=1560263162699&to=1560285999063&var-Schedds=All&var-T1_Sites=All&panelId=20&width=1000&height=500&tz=Europe%2FZurich

lecriste commented 5 years ago

More details: the timed out occurred in bunches: at 18:12:07, 20:14:31, 21:16:10, 22:18:10, 23:19:20 CEST(+0000) on vocms0196 and all such postJob logs have the usual warning: Tue, 11 Jun 2019 17:36:35 CEST(+0000):WARNING:PostJob Unable to determine ASO start time from job report. Will use ASO start time = 2019-06-11 17:36:35.582054 (1560267395).

and this Traceback:

Tue, 11 Jun 2019 18:12:18 CEST(+0000):ERROR:PostJob Failure during post-job execution.
Traceback (most recent call last):
  File "TaskWorker/Actions/PostJob.py", line 1574, in execute
    retval, retmsg = self.execute_internal()
  File "TaskWorker/Actions/PostJob.py", line 1973, in execute_internal
    return self.check_retry_count(mostCommon(self.stageout_exit_codes, 60324)), retmsg
  File "TaskWorker/Actions/PostJob.py", line 2764, in check_retry_count
    self.set_state_ClassAds('COOLOFF', exitCode=exitCode)
  File "TaskWorker/Actions/PostJob.py", line 2658, in set_state_ClassAds
    self.logger.info("====== Finished to update classAds.")
  File "HTCondorUtils.py", line 75, in __exit__
    self.wpipe.close()
IOError: [Errno 32] Broken pipe
bbockelm commented 5 years ago

@lecriste -

Whats does 'atomically' mean exactly?

It means that all the edits show up at once. Other people doing condor_q will either see none of the edits or all of them.

Also, each "transaction" requires a fsync on the filesystem. If you do 1000 edits all separately, that's 1000 fsync's. If you do 1000 edits inside a single transaction, that's 1 fsync.

belforte commented 5 years ago

@bbockelm thanks. I guss the question was not sowhat atomically means in general, but rather whether the text meant "updated toghether in a single atomic transaction" or "each of them is updated in a single atomica transaction". Thanks for the tip about fsync, we surely want to minimize those on our schedd's.

belforte commented 5 years ago

@lecriste looks to me that we want to do all of:

  1. use transaction
  2. extend timeout
  3. retry wth exponential backoff and random intervals and possibly we may decide to keep trying "forever"
belforte commented 5 years ago

we also asked Madison, here's from Jaime:

On 14/06/2019 23:07, Jaime Frey wrote:> Here are a few thoughts on how to make your current setup work:
> * Have the post script retry a few times if condor_qedit fails. It should sleep some time in between attempts.
> * Set the initial leave_in_queue expression so that the job can leave the queue after it’s been in Completed status for more than a certain amount of time (say 1 hour).
> * Change the too-many-completed-jobs alert to ignore jobs whose leave_in_queue evaluates to true.
> 
> Even with these changes, this data from the post script can still get lost. I’ll have to think some more about a better solution.
> 
>   - Jaime
belforte commented 5 years ago

So I think we still want to do 1.-3. as per previous comment in https://github.com/dmwm/CRABServer/issues/5907#issuecomment-501450029 to be specific I suggest following retry policy, but any similar on will do, the important things are to increase delay at each iteration and have a random component

  1. try
  2. wait random interval in [0-1min] and try again
  3. wait random interval in [0-2min] and try again
  4. wait random interval in [2-5min] and try again
  5. wait random interval in [2-10min] and try again
  6. fail with error in PJ log

(for 1-4 I suggest to log a Warning)

AND when submitting a job, set the LeaveJobInQueue to an expression which evaluates to False 26 (or 30?)hours after job is completed, along the lines of

leavejobinqueue = (jobstatus==4 || jobstatus==3)  && (time() - enteredcurrentstats > 30*3600) 

here is an example of such code https://github.com/dmwm/CRABServer/blob/3c59bc12256b99de2414692731c93463cad64ae9/src/python/TaskWorker/Actions/DagmanCreator.py#L169-L176

belforte commented 5 years ago

by the way, I suggest to increase timeout from 1min to 3min. Even if I suspect that if schedd is busy we get an errore before the timeout.

bbockelm commented 5 years ago

A few other thoughts:

  1. We should make sure to tweak the monitoring so these jobs don't count as in "completed" state in terms of the monitoring. We can exclude jobs from that monitoring expression if LeaveJobInQueue is set to true.
    • For extra credit, one can have LeaveJobInQueue evaluate to false if no updates from a post script have occurred in 2 hours.
  2. Don't do the AuthenticatedSubprocess here; that will reduce the load on the host (by no forking the process).
  3. Once we reduce to a single transaction, if there are still timeouts we need to investigate further: the schedd shouldn't be unresponsive for that long.
belforte commented 5 years ago

+1

lecriste commented 5 years ago

With the current code PostJob managed to set/edit the ClassAds without any failure (for 2 jobs out of 9000 all of the five attempts were needed). The exception raised is always this one:

Traceback (most recent call last):
  File "TaskWorker/Actions/PostJob.py", line 2657, in set_state_ClassAds
    with self.schedd.transaction():
RuntimeError: Failed to connect to schedd.

Now I will work on the expression for LeaveJobInQueue.

belforte commented 5 years ago

I would have been happy with 2 failures overall. But 2/9k jobs needing 5 retries worries me when looking at 200k~400k jobs/day. Were the 9k jobs on a single task/schedd ? Which schedd ? Was it very loaded and busy ?

We can surely start with this as a way to test grafana dashboard and gave users some more or less working fallback. Then maybe we've been unlucky. But if this error rate stays we need to do something.

bbockelm commented 5 years ago

What is the failure rate with zero retries?

As best as I can understand, this is a reasonable amount of load and we shouldn’t see many issues.

Might be useful to set D_FULLDEBUG on the schedd and try to correlate with the transaction failures.

Is it possible to enable the condor debugging on the python client as well?

Is any of this code threaded? Periodically there have been reports of threading issues here.

lecriste commented 5 years ago

I would have been happy with 2 failures overall. But 2/9k jobs needing 5 retries worries me when looking at 200k~400k jobs/day. Were the 9k jobs on a single task/schedd ?

Yes: 190619_184040:lecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_randomRetry_noMaxIdleForReal_catchExc_highP

Which schedd ?

vocms0199

Was it very loaded and busy ?

Very loaded for sure as I am sending all my tasks there on puropose (see idle and completed+removed job graphs), although not many running jobs: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&from=1560875549249&to=1561048349250&refresh=1m&var-Schedds=All&var-T1_Sites=All

We can surely start with this as a way to test grafana dashboard and gave users some more or less working fallback.

Agree.

Then maybe we've been unlucky. But if this error rate stays we need to do something.

Looks like increasing the number of attempts can be a solution.

lecriste commented 5 years ago

What is the failure rate with zero retries?

175 over 9000. Out of these 175, the number of jobs succeeding at the:

  1. first retry: 141
  2. second retry: 23
  3. third retry: 9
  4. fourth retry: 2

As best as I can understand, this is a reasonable amount of load and we shouldn’t see many issues.

Might be useful to set D_FULLDEBUG on the schedd and try to correlate with the transaction failures.

Is it possible to enable the condor debugging on the python client as well?

Need to check.

Is any of this code threaded?

I don't think so.

Periodically there have been reports of threading issues here.

belforte commented 5 years ago

no threads.

belforte commented 5 years ago

but several PostJob scripts may be running at same time. Maybe we can use JobRouter to sort of aggregate changes over 10 min intervals and then do the classAd editing with the best technoloy ? I guess we start like this sine monitoring transition is high prio at last, and we need get more statistics.

lecriste commented 5 years ago

but several PostJob scripts may be running at same time.

Yes, currently we have a limit of 20 postJob scripts per task running simultaneously.

belforte commented 5 years ago

Looking at schedd duty cycle for vocms0199 Screenshot from 2019-06-21 10-16-49 from https://cms-htcondor-monitor.web.cern.ch/cms-htcondor-monitor/letts/aperezca/HTML/Schedds/crab3@vocms0199.cern.ch_status_168h.html

I am not that surprised that some condor_* command failed. The question at this point is rather why DC was so bad for so long. Are the test jobs "odd" or is all that condor_qedit the reason ? If the latter, this is a no go :-(

lecriste commented 5 years ago

The exception is raised at the first line here:

            with self.schedd.transaction():
                for param in params:
                    self.schedd.edit([self.dag_jobid], param, str(params[param]))
                self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))
                # Once state ClassAds have been updated, let HTCondor remove the job from the queue
                self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))

so it might even not execute any edit.

These are the test tasks I submitted to vocms0199: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?orgId=11&from=1560501891763&to=1561106691764&refresh=5m&var-user=lecriste&var-site=All&var-current_url=%2Fd%2FcmsTMDetail%2Fcms_task_monitoring&var-task=All

belforte commented 5 years ago

Yes the exception means that it can't talk to schedd. The question us whether the other thousands of calls that instead succeed are tge reason for the duty cycle at 99%. Schedd can't work properly with that dcsent from Stefano's mobileOn Jun 21, 2019 10:45, Leonardo Cristella notifications@github.com wrote:The exception is raised at the first line here: with self.schedd.transaction(): for param in params: self.schedd.edit([self.dag_jobid], param, str(params[param])) self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))

Once state ClassAds have been updated, let HTCondor remove the job from the queue

            self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))

so it might even not execute any edit. These are the test tasks I submitted to vocms0199: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?orgId=11&from=1560501891763&to=1561106691764&refresh=5m&var-user=lecriste&var-site=All&var-current_url=%2Fd%2FcmsTMDetail%2Fcms_task_monitoring&var-task=All

—You are receiving this because you were assigned.Reply to this email directly, view it on GitHub, or mute the thread.

bbockelm commented 5 years ago

Hi,

This is strange. Are there other tests ongoing? Is the PostJob code here being run multiple times?

These should be about the same amount of work as a shadow update, which every running job does every few months minutes.

Can we try again with the transaction code but skipping the edits?

When you start the next test, leave a note in the ticket and I will try to log in to the schedd and take a look.

Brian

lecriste commented 5 years ago

This is strange. Are there other tests ongoing?

There is a total of 3k running jobs from stadnard user tasks on average, see vocms0199 here: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&refresh=1m&from=1560945268469&to=1561118068469&var-Schedds=All&var-T1_Sites=All&panelId=1&fullscreen

Is the PostJob code here being run multiple times?

These should be about the same amount of work as a shadow update, which every running job does every few months minutes.

Can we try again with the transaction code but skipping the edits?

Here is the task: https://cmsweb-testbed.cern.ch/crabserver/ui/task/190621_114801%3Alecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_randomRetry_noMaxIdle_BrianTest_highP

When you start the next test, leave a note in the ticket and I will try to log in to the schedd and take a look.

It will take a while for the first jobs to run.

lecriste commented 5 years ago

@bbockelm, does it make sense to try the NonDurable TransactionFlag?

class htcondor.TransactionFlags Flags affecting the characteristics of a transaction. NonDurable Non-durable transactions are changes that may be lost when the condor_schedd crashes. NonDurable is used for performance, as it eliminates extra fsync() calls.

bbockelm commented 5 years ago

@lecriste - that would provide a small improvement, but I'm still confused as to why this would be noticeable load in the first place.

@belforte - where did you get the DC duty cycle graph fraom? I don't see it from @lecriste 's links.

lecriste commented 5 years ago

https://cms-htcondor-monitor.web.cern.ch/cms-htcondor-monitor/letts/aperezca/HTML/Schedds/crab3@vocms0199.cern.ch_status_168h.html

bbockelm commented 5 years ago

ARGH. I think I see the problem; it is here:

https://github.com/htcondor/htcondor/blob/master/src/python-bindings/schedd.cpp#L1834

It's missing the SetAttribute_NoAck flag. That means we have a round trip between schedd and the python client for each attribute set (and what is probably killing performance).

If the variant where you "create transaction and then skip edits" tests out fine, then we will have the condor dev get you guys a special build.

lecriste commented 5 years ago

ARGH. I think I see the problem; it is here:

https://github.com/htcondor/htcondor/blob/master/src/python-bindings/schedd.cpp#L1834

It's missing the SetAttribute_NoAck flag. That means we have a round trip between schedd and the python client for each attribute set (and what is probably killing performance).

If the variant where you "create transaction and then skip edits" tests out fine, then we will have the condor dev get you guys a special build.

Difficult to say now: I need to send a few more tasks and wait for a significant amount of jobs to complete in order to "reproduce" the same load.

belforte commented 5 years ago

Brian i have put an url below the plot. This is the stuff which Antonio put together. Sorry am in flight tiday. Can't do much

belforte commented 5 years ago

Recent DC still looks bad. @lecriste you should make sure that it is normalized before submitting again. https://cms-htcondor-monitor.web.cern.ch/cms-htcondor-monitor/letts/aperezca/HTML/Schedds/crab3@vocms0199.cern.ch_status_24h.html

belforte commented 5 years ago

Recent Duty Core cycle on vomcs0199 is reasonable since of 6am today. Now it seems the right time to submit tests. Screenshot from 2019-06-22 12-02-34

belforte commented 5 years ago

@bbockelm in order to confirm your current hypothesis that extra load is generated by a round trip for each attribute rather than one overall, maybe we should use a transaction with one single ad edit, rather than an empty one ? In case I'd start by only resetting LeaveJobInQueue, so we can verify that job/tasks handling is correct and then it will only be a matter of adding what's useful for monitoring once we have a patched condor. Of course we can do in steps: empty - one add edit - multiple adds edit.

For sake of controlled test (and me being busy with family) I'd rather not submit load myself. @lecriste if it gets impossible/difficult for you to keep working on this, let me know and send instructions on what you did so I may try to reproduce it.

lecriste commented 5 years ago

I have just submitted two more tasks (with 9k jobs each), one with no ad edit and one with a single ad edit (LeaveJobInQueue = false), to the same schedd. In the task with no ad edit that I submitted last night (when DC was still high), 5 jobs out of 9k needed all the five attempts to execute these lines:

            with self.schedd.transaction(htcondor.TransactionFlags.NonDurable):
                #for param in params:
                #    self.schedd.edit([self.dag_jobid], param, str(params[param]))
                #self.schedd.edit([self.dag_jobid], 'CRAB_PostJobLastUpdate', str(time.time()))
                ## Once state ClassAds have been updated, let HTCondor remove the job from the queue
                #self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))
                self.logger.info("       -----> Finished %s -----", msg)
                self.logger.info("====== Finished to update ClassAds.")
                break
belforte commented 5 years ago

i still believe that errors have to be expected when duty cycle is >98%

belforte commented 5 years ago

Can you do one task at a time ? Or two of the same kind ? Otherwise we do not learn anything. The point is to find what drives schedd in overload. Not which command works in spite of that.

lecriste commented 5 years ago

Duty cicle reached again 1 after the recent two tasks submission: image I am not killing the current tasks in case @bbockelm wants to debug more.

I will submit another single task (with no ad edit) as soon as DC becomes normal again.

lecriste commented 5 years ago

In case it helps, I noticed a few jobs (for instance 24805218.0) being removed with

RemoveReason = "The system macro SYSTEM_PERIODIC_REMOVE expression '( ( ( JobUniverse =!= 7 ) && ( JobUniverse =!= 12 ) && ( ( ( NumJobStarts > 9 ) =?= true ) || ( ( NumShadowStarts > 19 ) =?= true ) ) ) ) || ( ( DiskUsage > 27000000 ) =?= true )' evaluated to TRUE"

and indeed:

NumShadowStarts = 28

belforte commented 5 years ago

I have no experience with problems with NumShadowStarts, can't guess what's going on. But maybe it is a consequence of schedd being too busy to do all that's asked of it.

lecriste commented 5 years ago

The duty cycle became again reasonable as of 7am. It seems correlated to the trend in this graph: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&from=1561208249917&to=1561294649917&refresh=5m&var-Schedds=All&var-T1_Sites=All&panelId=20&fullscreen

which was due mostly to Removed jobs.

I have just submitted another task with no ad edit: https://cmsweb-testbed.cern.ch/crabserver/ui/task/190623_125129%3Alecriste_crab_NotAutomaticSplitt_data_preprod_noAuthSubpr_noMaxIdleAndPJ_BrianTest_NonDurable_normDC2_highP

belforte commented 5 years ago

vocms0199 appears to be fully unresponsivel. I loogged in and found that / partition is 100% full. belforte@vocms0199/condor> df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 160G 160G 20K 100% / belforte@vocms0199/condor>

On the non trivial side, one week ago we had the classAdd editing enabled on all schedd's and while some were failing so that we ended up with the too many classads in "complete" due to the leavejobinqueue not being turned to false, many succeeded and we had no DutyCycle problem. Whay all those problems on vocms0199 now ? Is it really the condor_qedit ?

lecriste commented 5 years ago

I submitted a single task on a "clean" schedd (vocms0194) with this code: https://github.com/dmwm/CRABServer/blob/d8c924540e5b57aa779dff425ce0975a37c40778/src/python/TaskWorker/Actions/PostJob.py#L2658-L2666

Only one postJob out of 9k failed on the first attempt with the usual error:

Traceback (most recent call last): File "TaskWorker/Actions/PostJob.py", line 2658, in set_state_ClassAds with self.schedd.transaction(htcondor.TransactionFlags.NonDurable): RuntimeError: Failed to connect to schedd.

Here is the trend of completed jobs in the queue: https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&refresh=5m&from=1561473394311&to=1561646194311&var-Schedds=crab3%40vocms0194.cern.ch&var-T1_Sites=All&panelId=20&fullscreen

and this the Duty Cycle: image

Completed jobs started to leave the queue today around 13h, when this expression evalauted to True: LeaveJobInQueue = ifThenElse(( JobStatus =?= 4 || JobStatus =?= 3 ) && ( time() - EnteredCurrentStatus < 30 * 60 * 60 ),true,false)

belforte commented 5 years ago

Do I understand correctly that means good?Sent from Stefano's mobile 

lecriste commented 5 years ago

Well, the DC close to 1 when postjob scripts started (at most 20 simultaneously) does not look good to me. And all Schedd.edit() were disabled, only the Schedd.transaction() was executed. According to @bbockelm this should not happen?

belforte commented 5 years ago

ah.. that's not so good. I could not read the time correlation in the plot, sorry. I guess you want to compare with not calling the schedd transation, is that the reason of the hihg DutyCycle, or is it simply due to so many jobs completing at same time ? 9k jobs comoleting in a couple of hours (or less) is not usual. Again... when all jobs were doing all edits even w/o the transaction we suffered from the fraction which failed and stick in queue, not from high DC.

lecriste commented 5 years ago

ah.. that's not so good. I could not read the time correlation in the plot, sorry. I guess you want to compare with not calling the schedd transation, is that the reason of the hihg DutyCycle, or is it simply due to so many jobs completing at same time ?

The aim of the test was: let's try with the lightest configuration, i.e. a Schedd.transaction() without Schedd.edit(), and see how the schedd reacts, as you suggested too.

9k jobs comoleting in a couple of hours (or less) is not usual. Again... when all jobs were doing all edits even w/o the transaction we suffered from the fraction which failed and stick in queue, not from high DC.

Didn't we say that jobs stuck in the queue (Completed or Removed) have no effect? I do not remember the DC trend with that situation.