dmwm / CRABServer

15 stars 38 forks source link

PostJob fails to do condor_qedit to set status ads and reset LeaveJobInQueue #5907

Closed belforte closed 5 years ago

belforte commented 5 years ago

see https://cms-logbook.cern.ch/elog/GlideInWMS/6927 and the full thread there. Also related to #5906

belforte commented 5 years ago

I completed in coda non dovrebbero dare noia, ma tsnti jobs che finiscono assieme può brevemente affaticate. Quanto è durato il picco del DC? sent from Stefano's mobileOn Jun 28, 2019 01:06, Leonardo Cristella notifications@github.com wrote: ah.. that's not so good. I could not read the time correlation in the plot, sorry. I guess you want to compare with not calling the schedd transation, is that the reason of the hihg DutyCycle, or is it simply due to so many jobs completing at same time ?

The aim of the test was: let's try with the lightest configuration, i.e. a Schedd.transaction() without Schedd.edit(), and see how the schedd reacts, as you suggested too.

9k jobs comoleting in a couple of hours (or less) is not usual. Again... when all jobs were doing all edits even w/o the transaction we suffered from the fraction which failed and stick in queue, not from high DC.

Didn't we say that jobs stuck in the queue (Completed or Removed) have no effect? I do not remember the DC trend with that situation.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

bbockelm commented 5 years ago

Ok, I think we have done all the footwork possible here. We should bump this up to the HTCondor folks (@belforte - can you make sure to bring it up at the next weekly meeting?). There's at least one obvious bug, although it's not-so-obvious to me why the "empty transaction" would be problematic.

lecriste commented 5 years ago

I submitted another task enabling only one Schedd.edit():

                self.schedd.edit([self.dag_jobid], "LeaveJobInQueue", classad.ExprTree("false"))

and the Duty Cycle shows the same trend of the previous test, with a 2x longer peak (~2 hours): image

In this test Completed jobs are leaving the queue as postJobs run (at most 20 simultaneously) because of the LeaveJobInQueue Schedd.edit(): https://monit-grafana.cern.ch/d/000000185/cmsops-crabmetrics?orgId=11&from=1561785922017&to=1561829122017&refresh=5m&var-Schedds=crab3%40vocms0194.cern.ch&var-T1_Sites=All&panelId=20&fullscreen

belforte commented 5 years ago

@lecriste what about errors ? Can you do one last thing and submit the same task via production TW, i.e. with current PJ ? To check that the peak in DutyCycle is really due to the call the ptyhon binding ? I'd be also curious of what is the result when forking a condor_qedit.

lecriste commented 5 years ago

@lecriste what about errors ? I sent three identical tasks to collect more statisics:

  1. 2 postJobs needed a second attempt,
  2. 4 postJobs needed a second attempt,
  3. half of the postjobs still to run but already 14 needed a second attempt.

Can you do one last thing and submit the same task via production TW, i.e. with current PJ ? To check that the peak in DutyCycle is really due to the call the ptyhon binding ?

I sent it via preprod TW to be allowed to set maxIdle = -1, as in the previous tests: https://cmsweb-testbed.cern.ch/crabserver/ui/task/190630_221756%3Alecriste_crab_NotAutomaticSplitt_data_preprod_oldPJ_noMaxIdle_highP

I'd be also curious of what is the result when forking a condor_qedit.

belforte commented 5 years ago

Honestly, I am feeling reasonably optimistic. This Recent Duty Core Cycle in last 24h has no particularly worrying structure Screenshot from 2019-07-01 12-39-15 until 2am this morning when also the last task (w/o call to the binding) was completed Screenshot from 2019-07-01 12-41-24

at some point it will be good to get same kind of task progress over time also in grafana, should not be difficult, I hope.

I am inclined to go for this starting by simply editing LeaveJobInQUeue to verify that we have a working process to keep jobs around until PJ completes. Then we address the bug that Brian mentioned with Condor developers and can get the status updates in.

We have also the possibility to use the JobRouter to do the add edit, it would scale better.

Each thing that we try requires time in order to test at scale in production, but I do no see us as w/o options yet.

Of course, if sending a message to AMQ in some way would be easier, nothing against it, surely not the complex document retrieve-edit-replace in ES which Valentin developed, we need something simple where we push a few key/value pairs and the MONIT services aggregates. Did you talk with them about possibilities here ?

bbockelm commented 5 years ago

Don't forget that there's other workflows besides AMQ - such as reporting to es-cms.cern.ch - that use the raw classads. If we can make things work, there's power in the simplicity of "everything goes through the ClassAd".

lecriste commented 5 years ago

I am inclined to go for this starting by simply editing LeaveJobInQUeue to verify that we have a working process to keep jobs around until PJ completes.

Shall we put this in production then? With such code I think we should increase the threshold of completed jobs for a schedd to go in critical.

Then we address the bug that Brian mentioned with Condor developers and can get the status updates in.

We have also the possibility to use the JobRouter to do the add edit, it would scale better.

Each thing that we try requires time in order to test at scale in production, but I do no see us as w/o options yet.

Of course, if sending a message to AMQ in some way would be easier, nothing against it, surely not the complex document retrieve-edit-replace in ES which Valentin developed, we need something simple where we push a few key/value pairs and the MONIT services aggregates. Did you talk with them about possibilities here ?

Yes, they seem fine with sending a document per postJob in a separate connexion. We asked for the credentials to use in the schedds: https://cern.service-now.com/service-portal/view-request.do?n=RQF1346205 and they want to know who to contact in case of problems.

bbockelm commented 5 years ago

With such code I think we should increase the threshold of completed jobs for a schedd to go in critical.

I think it's probably better to just filter out these jobs (excepting these jobs, 'C' state usually means an overload) completely.

That said, if you do record these separately, you now have a nice metric for the number of postjobs running!

Yes, they seem fine with sending a document per postJob in a separate connexion.

But this means other sources of monitoring are potentially bad. Let's not go here unless we have to. Again, there's lots of simplifying power if all the monitoring is consolidated to a single source.

belforte commented 5 years ago

hey @bbockelm

I think it's probably better to just filter out these jobs (excepting these jobs, 'C' state usually means an overload) completely. I'd love to, but don't we need a somehow expensive query ? The current expression only uses numbers available as schedd adds.