dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Set PostJobPrio1 for WMAgent jobs #5688

Closed bbockelm closed 9 years ago

bbockelm commented 9 years ago

The PostJobPrio1 attribute provides a secondary sorting key for HTCondor. This allows you to, for example, first sort jobs by priority then by task injection date. Doing this provides an ordering to the completion of tasks of the same priority (currently, suppose tasks A and B of the same priority are both 50% done after 24 hours; with this change, task A may be 100% done and B 0% done).

This gets much worse if, instead of 2 tasks, you consider 30 tasks in a campaign.

Ordering like this provides a significant shortening of task tails. Further, due to implementation details, it significantly decreases matching costs in HTCondor.

See the corresponding CRAB3 ticket: https://github.com/dmwm/CRABServer/issues/4705

bbockelm commented 9 years ago

@amaltaro @tsarangi @ericvaandering @ticoann - you guys may find this one interesting / useful.

hufnagel commented 9 years ago

Hm, so we just need to change the jdl

PostJobPrio1 = TaskName

or maybe if we want to prioritize whole workflows and not individual tasks

PostJobPrio1 = WorkflowName

Problem with the latter is that job priority between tasks in a workflow can still differ. Would this interfere with identical PostJobPrio1 ?

bbockelm commented 9 years ago

Well, it must be mapped to an integer, not a string - can't be literally be the task name!

Job priority is still respected first and foremost. So, within the same workflow, if we boost the priority of merging, merges will always go first.

However, listening to today's discussion at O&C week, a more appropriate way to prioritize might be based on when the block was acquired by WMAgent. This would reduce the time per block, but still allow the WMAgent to effectively time-share between tasks of the same priority.

hufnagel commented 9 years ago

So some hash from string to id would be needed...

Block acquiring time won't do much for you. Multi-task workflows (and all Ops workflows are basically multi-task, since you have at least one merge step) only differentiate in the top level task between different blocks. The merge step sees the output of the previous blocks processing as a unified set of files, they all go into the same fileset. The information what block they came from is lost or at least would be very hard to backtrace.

As such, you could use block acquiring time to prioritize the initial block processing only. Given that we only feed more data when needed, I am not sure how much that really buys us. You put a higher emphasis on processing in the order you inject data, but you don't get much benefit for the whole request processing nor does this emphasis on order of injection really survive to the merge step.

Also, there are workflows that don't start with a block and I am not only talking about MC here. There are multi-task workflows where later steps have data as input, this would not be fed from a block.

There is the Tier0, where none of the tasks with data input are fed from a block.

hufnagel commented 9 years ago

It's a policy decision, but IMO given two requests of the same priority, the one injected first should win and finish to completion and the other request only should get resources if the first request can't fill all available resources. At least that's the policy I want for the Tier0, might not be what we want for Ops.

Modulo internal differences in the request of course, merge etc jobs with higher internal priority of the second request would still jump over processing jobs from the first request.

bbockelm commented 9 years ago

Hi Dirk,

Based on your comment, I believe I am using the phrase "block acquiring time" incorrectly. I meant when the "chunk of jobs" are brought into the WMAgent (from request manager).

I think that matches what you want for the T0.

Brian

hufnagel commented 9 years ago

Ah, ok, that makes more sense, yes. Still not exactly what I want for the Tier0 because we don't control when we feed data into the system. Usually data arrives in run order in the Tier0 and we want to process in that order too. Sometimes (part of) a runs data can arrive late though and in this circumstance it should still benefit of the priority boost for earlier runs.

For the Tier0, setting PostJobPrio1 to run number might make the most sense. Express and Repack are still going to have higher assigned priority than PromptReco, but otherwise we really do not need to prioritize between different data in the same run.

bbockelm commented 9 years ago

@hufnagel - does this mean you're going to implement? I'm most likely not the right person to code this up...

hufnagel commented 9 years ago

Well, I know what I want for the Tier0, but this should be implemented in a way that works for the general WMAgent and only to second order is usable or customizable for me.

I am not 100% sure which information about the workflow/request is actually easily available at submission time (which is where you modify the jdl). The concept of "bunch of jobs injected into the agent together" might not make sense at that point anymore. I do know that you have access to the task name and workflow name, from the two the latter would make more sense.

hufnagel commented 9 years ago

If we want to use some parameters that aren't easily available at submission time, we would have to modify a lot of code, as passing this information along is not exactly trivial...

ticoann commented 9 years ago

Sorry for joining the conversation late. I was just wondering whether PostJobPrio1 is really necessary since eventually what is need is priority comparison. So if a priority can be defined before job is submitted (If all the condition is known at that moment I am not sure whether we need to have hierarchy of priority since that hierarchy can be map to a single number).

I think it is more about the question of what will be the policy for the priority mapping. In terms of policy, I think we can separate that from the code itself. Right before JDL is created we can have a function which takes the policy definition and condition available in at that time (currently workflow and task name, and initial priority but could add other things) and policy can be also loaded at that time or periodically loaded so the priority can be applied on the fly when policy changed (or T0 and others can have different pollicy. We can implement with that structure with the current policy and define (improve) the policy later (Maybe this went a little off the topic).

bbockelm commented 9 years ago

There are definitely cases where a hierarchy is needed because the hierarchy cannot be mapped to a single number. Look at the linked CRAB3 ticket, for example. What mapping function would you propose?

Further, there are costs to have a more dynamic range in JobPrio -- as far as the negotiator is concerned, it must handle each unique JobPrio separately (meaning there's an overhead per-prio used). By encoded schedd-local information into the PostJobPrio1, this all stays internal to the schedd.

ticoann commented 9 years ago

Hi Brian, If we have dynamic range of priority on both levels, we probably can't map to single number. But if we have finite range on both hierarchy we can. (a certain number of digit from first belongs first hierarchy (JobPrio) and the last belong to the second level (PostJobPrio1).

I wasn't suggesting to have dymaic range in JobPrio, In Agent we have priority number like 10000, 7000. We are not using all the range, so we could divide the things for the hierarchy. Anyway if it makes more sense to have hierarchy that would be fine too. (The priory mapping function can return 2 integers).

hufnagel commented 9 years ago

Question, the way I understood it PostJobPrio1 isn't actually a priority itself, but provide a secondary prioritization based on when that PostJobPrio1 is defined in the system. Is that correct ? Or do we actually have to set meaningful numbers ?

ericvaandering commented 9 years ago

Using the assignment or request time would make sense. As Dirk says, this is only easy if the data is there at the job level. I also understand this as a secondary ranking only in the case where priority is equal. The idea being to preferentially finish up workflows or blocks of works that have been started.

hufnagel commented 9 years ago

I talked with Brian about this at the OSG all hands meeting, one idea would be to set PostJobPrio1 to the negative task id. The idea here is that task ids are sequentially assigned with new work that is expanded into the agent, so you automatically get an ordering of earlier tasks having higher secondary priority, exactly what we want.

hufnagel commented 9 years ago

I am going to implement something along these lines, its worth exploring I think. Just have to figure out if the task id is available at submission time. Task name certainly is, so id should be there too or if not, it shouldn't be hard to get it.

hufnagel commented 9 years ago

Ok, that was a complete failure. WMBS does not record the task independently, it's just a text field inside the wmbs_workflow table. Back to the drawing board...

hufnagel commented 9 years ago

Ok, wmbs_workflow has multiple entries with the same workflow name, one for each task. So I can use the negative workflow id belonging to a certain task name and that should work.

In terms of effort its adding one returned row to a query and then passing one additional parameter through a couple JobSubmitter layers to the condor plugins, where that parameter can then be used to set PostJobPrio1 in the JDL. Will work on it tomorrow.

bbockelm commented 9 years ago

I've been thinking more on this one.

Would it make more sense to set PostJobPrio1 to

-stringListSize(DESIRED_Sites)

and then use PostJobPrio2 as below?

This way, workflows that are limited to a single site are preferred over those that can run anywhere.

(Tired of seeing GEN-SIM jobs crowd out DIGI-RECO on the T1s as they are the same prio.)

Sent from my iPhone

On Mar 26, 2015, at 7:17 PM, Dirk Hufnagel notifications@github.com wrote:

Ok, wmbs_workflow has multiple entries with the same workflow name, one for each task. So I can use the negative workflow id belonging to a certain task name and that should work.

— Reply to this email directly or view it on GitHub.

hufnagel commented 9 years ago

Whether it makes sense is a policy decision, someone else should comment on it. Kind of makes sense to me... In term of implementation, this is trivial, the list of possible locations is available at submission time of course.

hufnagel commented 9 years ago

Ok, I have the negative task_id/workflow_id working now, modifying this with the two prong approach with the number of possible sites for a job would be easy as soon as we make a decision on this.

bbockelm commented 9 years ago

@hufnagel - can we get this merged up?

ericvaandering commented 9 years ago

We are in the usual pre-CMSWeb release merge freeze. Besides that Seangchan may be out of commission and Dirk is probably traveling.