dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Add support to specify job requirements #2958

Closed PerilousApricot closed 9 years ago

PerilousApricot commented 12 years ago

Hey all,

It would be really rad (but not critical .. well, maybe critical) to add support in WMSpec and BA to add requirements for jobs submitted to the cluster. There could be a common set of options like memory/disc/walltime/# cores requirements that would be translated into scheduler-specific directives (of course with some sort of sensible defaults so that specs without this information wouldn't be rejected).

Two goals:

1) For schedulers that keep track of resources and slay jobs that get too big, there needs to be some way to request more memory for some workflows (looking at you, HI processing jobs)

2) For schedulers that care about walltime, a lot of efficiency could be gained by specifying that some jobs (for instance, LogArchive/cleanup jobs) are going to require little walltime, then they can be backfilled before earlier jobs.

I'd be willing to knock in the changes to WMSpec and the schedulers I have access to, and someone else could fix the other schedulers later (if nothing happens with them, the behavior would be the same as now, so it wouldn't be a problem if other people never got on it)

DMWMBot commented 12 years ago

mnorman: I'm a bit wary of adding what comes down to an arbitrary set of variables for matchmaking purposes, although it is possible. I don't think that the walltime idea is a good one, because we really do want jobs to run in the order we tell them too (Merge having priority, etc), and not attempt to reschedule anything based on what are essentially educated guesses.

Also, I don't like # cores - since that's handled completely differently in condor (one of Condor's major weaknesses if you ask me). We shouldn't add features that only work some of the time.

PerilousApricot commented 12 years ago

meloam: It's not an arbitrary set of variables, it's just three well-defined ones.

I agree about nCores, I was just tossing out examples.

Walltime hints to schedulers that care about running length that they can backfill jobs before other ones, making things more efficient. right now (for instance) we tell the scheduler that it's going to take a whole day to do a logCollect job, when it could really be sliding that job in some other window of time. Schedulers will still apply priorities and if they don't care about it, they won't do anything with them.

DMWMBot commented 12 years ago

mnorman: Why are you telling schedulers how long the jobs are expected to take? I don't think this is universally supported.

PerilousApricot commented 12 years ago

meloam: You're right, it's not universally supported. On the schedulers that it's not supported, it's not necessary (and will be ignored). But for schedulers that do support it, they will vomit or overschedule jobs. Right now, sites make global assumptions about what resources jobs will take, and will actively slay jobs that exceed those suggestions. On those schedulers, we're wasting resources and unnecessarily slaying jobs

hufnagel commented 12 years ago

hufnagel: First a counter argument to keep things in perspective, than a proposal.

We always try to schedule jobs that run the canonical 8 hours to a day maybe. If we know ahead of time that a job would take 16GB to run and 1 week to complete, passing this information to the submission systems is kind of pointless because what you really want to do is not to create such jobs in the first place.

You could argue that even with all somewhat sane jobs the execution layer could still do some optimizations if it could distinguish between different types of jobs, but that's a second order optimization and in no way critical. Things should always work without this or something is broken further upstream.

That being said, it would be good if we could define some parameters in the specs for a step and then jobs from that step would pass these parameters to the submission system. We don't even have to agree what these would be, it can be generic, freely definable parameters that look different in each submitter. When you define a spec, you have a rough idea where it will or could run and what parameters might be useful. Like for the Tie0 when I create the specs, I would put the LSF resource reservation parameter in there and the LsfSubmitter could honor them. If the parameter isn't present, a submitter default would be used (or the submission system might have it's own default builtin). If a parameter is specified that the submission system does not know, it's ignored.

PerilousApricot commented 12 years ago

meloam: Replying to [comment:6 hufnagel]:

First a counter argument to keep things in perspective, than a proposal.

We always try to schedule jobs that run the canonical 8 hours to a day maybe. If we know ahead of time that a job would take 16GB to run and 1 week to complete, passing this information to the submission systems is kind of pointless because what you really want to do is not to create such jobs in the first place.

Cool, but the major problem with that is there's cases where you just can't subdivide further. It might be easy to say, "the HI guys screwed up, they don't get to use the data until they change things around to fit in 8 hours and 2GB/core", but it not go over too well.

They have a lot of analyses and even production crap that exceeds 2GB/core. Or generate lumisections of data where the detector vomitted way too many events out to file, and even at the finest-grained lumisplitting ends up taking 30 (? I seem to remember that being tossed around) hours to process.

You could argue that even with all somewhat sane jobs the execution layer could still do some optimizations if it could distinguish between different types of jobs, but that's a second order optimization and in no way critical. Things should always work without this or something is broken further upstream.

I disagree. As more algorithms start to blow up (especially with the higher pileup data coming in) we'll basically have two good choices and one bad choice moving forward -- this patch (which is pretty much done, I just need to figure out how to test it) globally get people to modify their gatekeepers/schedulers to assume that all CMS jobs will have a higher CPU utilization (running less jobs/machine) or some byzantine system there people are going to toss different jobs to different WMAgents configured with different options in their jobsubmitters

Also, the efficiency's a pretty big problem when you're trying to run in a multicore-scheduling system (because the scheduler has to insert idle time for cores if it has a deadline to get a multicore job onto a node but there's no jobs short enough to fit in). We have some molecular biology dudes who run 64 node (512 core) jobs, and when the scheduler is trying to vacate nodes, a lot of cores end up being really idle.

That being said, it would be good if we could define some parameters in the specs for a step and then jobs from that step would pass these parameters to the submission system. We don't even have to agree what these would be, it can be generic, freely definable parameters that look different in each submitter. When you define a spec, you have a rough idea where it will or could run and what parameters might be useful. Like for the Tie0 when I create the specs, I would put the LSF resource reservation parameter in there and the LsfSubmitter could honor them. If the parameter isn't present, a submitter default would be used (or the submission system might have it's own default builtin). If a parameter is specified that the submission system does not know, it's ignored.

I think that's a good idea. I wanted to at least as a first pass try and find some commonly used options so we could do the "translation" once (since everyone has their own way of doing it), but I guess some sort of way to pass down parameters to the submitter would work for "other" cases. I can stick it in the cached job pickle in the creator and have it get picked up by the submitter, but I don't know much about LSF, so I'd might need help putting the info in the right place for the scheduler to pick it up.

hufnagel commented 12 years ago

hufnagel: > Cool, but the major problem with that is there's cases where you just can't subdivide further. It might be easy to say, "the HI guys screwed up, they don't get to use the data until they change things around to fit in 8 hours and 2GB/core", but it not go over too well.

Well, we should still shoot for jobs to first order to be able to run without doing anything special. If you absolutely have to attach a "reserve 8GB of memory and allow a wall time of 2 weeks" notice to a job, you automatically increase your chances of things breaking quite significantly. Special cases should remain that, special cases.

They have a lot of analyses and even production crap that exceeds 2GB/core. Or generate lumisections of data where the detector vomitted way too many events out to file, and even at the finest-grained lumisplitting ends up taking 30 (? I seem to remember that being tossed around) hours to process.

IMO, for the most part these should be fixed in the job splitting. Or before the request is made. If we go down the route of "you can get away with any stupid large and memory hungry request because the resource needs was specified upfront" then things will break. This needs to be special case only and a second order optimization. If it becomes the norm, things won't work.

You could argue that even with all somewhat sane jobs the execution layer could still do some optimizations if it could distinguish between different types of jobs, but that's a second order optimization and in no way critical. Things should always work without this or something is broken further upstream.

I disagree. As more algorithms start to blow up (especially with the higher pileup data coming in)

The we need to fix the algorithms or if that is not possible, handle it in the job splitting. We simply get hosed royally if all of a sudden 75% of all CMS jobs need 3GB of memory to run. No amount of resource allocation tweaking will save us.

Even if you could do what you wanted and survive, given the hardware we have available, we would maybe make use of 50% of out total cpu. So we loose by default.

That being said, it would be good if we could define some parameters in the specs for a step and then jobs from that step would pass these parameters to the submission system. We don't even have to agree what these would be, it can be generic, freely definable parameters that look different in each submitter.

I think that's a good idea. I wanted to at least as a first pass try and find some commonly used options so we could do the "translation" once

Wouldn't worry about it. Thing is, once we provide the capability, we can just add options to one submitter and see how they behave. Different submitters will have different options that are more or less useful. As things stabilize and we find options that are generally useful, we could unify in a second step, but even that isn't strictly needed, there is no problem putting multiple options into the same spec and different submitter will use their respective ones.

Having a simple way to use this capability and testing out some things is more important than agreeing on common options (and you can't do that anyways until you know what options will or won't work).

PerilousApricot commented 12 years ago

meloam: I guess we can agree to disagree on how necessary it is. Either way, the patch will be awesome and I'm sure it won't have any bugs.

hufnagel commented 12 years ago

hufnagel: Ah, what youthful optimism... You see, while I agree that our patch will work and we'll be able to pass options to the submission system and to the grid, I also think that most (ie. almost all) grid sites will simply ignore these options.

PerilousApricot commented 12 years ago

meloam: Then I'll rewrite the grid too

stuartw commented 12 years ago

swakef: Replying to [comment:10 hufnagel]:

Ah, what youthful optimism... You see, while I agree that our patch will work and we'll be able to pass options to the submission system and to the grid, I also think that most (ie. almost all) grid sites will simply ignore these options.

Note that allowing users to pass variables down to the scheduler has recently been implemented by CREAM ce's (for glite submission) (I believe allowed variables are limited but i suspect the list will grow over time).

DMWMBot commented 12 years ago

mnorman: Replying to [comment:10 hufnagel]:

Ah, what youthful optimism... You see, while I agree that our patch will work and we'll be able to pass options to the submission system and to the grid, I also think that most (ie. almost all) grid sites will simply ignore these options.

This is my concern. I don't want to have to store more on the disk cache and in the JobSubmitter memory cache for features that may only work on one site. If anything we should be simplifying the submission routines, not complicating them. This is especially true when we don't even know if inserting any of the features will do anything.

hufnagel commented 12 years ago

hufnagel: > This is my concern. I don't want to have to store more on the disk cache and in the JobSubmitter memory cache for features that may only work on one site. If anything we should be simplifying the submission routines, not complicating them. This is especially true when we don't even know if inserting any of the features will do anything.

I'll definitely make good use of this for the Tie0 Resource reservation in LSF is heavily used by Tier0Ops to keep things working and they'll want to keep it. But that's only one number, shouldn't add significant overhead.

In any case, adding the capability in itself will not add much/any overhead at all. For that you actually have to start using it for real and define some options in the spec. And we shouldn't do that until there is a good use case and the options are actually needed and useful.

DMWMBot commented 12 years ago

mnorman: Okay, if Melo is going to do this, I'm going to pass it off. Here are the requirements:

1) If this is general purpose, the total size of the arguments passed in must be limited to 1kB. That's the size of the entire dictionary. This check has to be made at assignment time to prevent users from abusing it.

2) It has to be understood that, by default, we will ignore all things put into the arguments section, and that they will only be enacted by special arrangement with the operators running the Agent and the site admins.

3) All arguments passed into the dictionary will have to be properly validated and vetted. This probably means that we will have to disallow anything that requires escape characters. Also it may mean that we accept numeric arguments only. I'll leave this to Melo to implement and to Lassi to approve.

4) Because the use of this varies from site to site and agent to agent, it will be up to the local operators and site admins to tell their users how this works. Central documentation will probably be a disaster, so we'll have to depend on local communication channels.

That's all. I guess I'm turning it over now.

hufnagel commented 12 years ago

hufnagel: A few comments:

1) If this is general purpose, the total size of the arguments passed in must be limited to 1kB. That's the size of the entire dictionary. This check has to be made at assignment time to prevent users from abusing it.

Agreed. Enough for testing and it protects us against abuse. If we eventually need more we have to find another way to support this.

2) It has to be understood that, by default, we will ignore all things put into the arguments section, and that they will only be enacted by special arrangement with the operators running the Agent and the site admins.

Hm, why would the sites care ? Isn't this strictly between submitter plugin and WMSpec writer ? The arguments would then just configure the job submission differently, but still "within already existing and supported" boundaries. As always, sites are then somewhat free to use or ignore job options that are passed from the WMS layer (as is already the case). Enforcing certain options, passing them down to the local batch layer and making sure they get used there is not our job, it's up to FacOps to push for that if we want it generally supported.

3) All arguments passed into the dictionary will have to be properly validated and vetted. This probably means that we will have to disallow anything that requires escape characters. Also it may mean that we accept numeric arguments only. I'll leave this to Melo to implement and to Lassi to approve.

Again, why ? All options passed from the submitter to the grid or batch layer need to be well tested and supported, but what we put into the spec can be a free for all. You either support it in the submitter or you don't.

And why numeric arguments ? Just support a generic way to pass a dictionary to the submitter. For the LSF submitter I want strings for instance. That allows configuring certain jobs to run on certain types of batch nodes for instance (likely will get used for the PromptCalibration at some point).

4) Because the use of this varies from site to site and agent to agent, it will be up to the local operators and site admins to tell their users how this works. Central documentation will probably be a disaster, so we'll have to depend on local communication channels.

Stop caring about that part. All we should provide is the "capability" to pass options to the submission layer. What these options are and do and if sites have to support them and to what extent are DataOps and FacOps problems. Not ours.

hufnagel commented 12 years ago

hufnagel: Basically:

1) A way to pass task dependent options from the spec to the submitter

Limited to 1k in total lenght, should be generic (dictionary, attributes, etc).

2) Modifications in submitters to support options

Highly dependent on the submitter and use case. Different submitters can support completely different options to do different things. Any option implemented needs to be generally supported by the submission system and the infrastructure (grid, batch) behind it and not cause any problems. Apart from the usual operator errors of course (system failed because it did what I asked it to do, not what I wanted it to do).

The two parts are not correlated in any way. This ticket should be closed when 1 is possible. For the second part, any option to be added to any submitter would then be a separate ticket and needs to be evaluated standalone.

DMWMBot commented 12 years ago

mnorman: Answers in order of questions asked:

2) The sites care because they have to implement the options, and not all options are implemented by default. I'm putting this out there as a public statement that if you submit a job that says "Pass value X" to the submitter, but the underlying factory/batch system ignores value X, then this is not an agent problem. To ensure that what you're passing in actually does what you want, you have to talk to the site admin.

3) Because I haven't done, and don't want to do, a full security validation of all the batch systems. I don't even know how this works with condor. If you can put in any string you want you can switch off the strong authentication and restrictions normally placed on a job. Does this allow the job to reverse compromise the condor frontend submitter? I think the answers is no, but if Lassi asked me, I couldn't back that up. I also can't prove that you don't have the ability to force a job to run local fork and run it on the submitter instead of on a worker node. Or that the underlying batch systems handle escape characters correctly.

This is a CYOA move for me. The idea of passing unfiltered and unchecked code directly to the batch systems for internal execution bothers me, and if it bothers me, I think the security people may be distinctly more then bothered. I intend to avoid this by at least removing escape characters, and possibly even line returns, from the input.

4) This is another thing I'm just making clear at the offset. If I receive a request saying "It's good that you've implemented this, now document everything that you can do with it", I'm going to pass it on to Melo and ignore it (and hopefully he should ignore it to). I'm just setting my prerequisites out there where people can see.

PerilousApricot commented 12 years ago

meloam: If someone had access to the scheduler (which they necessarily need, either locally or via the grid) then they can already do whatever they want. If they run off to idiotville and submit bad jobs, then they already had that option.

I agree with dirk -- if (.*)Ops makes a boned request, then it's their own problem. If someone tries to come to me later and argue that it's my fault they used a tool wrong, I'll just print out that mail and frame it on my wall

evansde77 commented 12 years ago

evansde: Well, this thread turned into a monster... I would look at adding a few predefined parameters for well known settings that can be mapped to appropriate batch system quantities like walltime, numCPUs, memory and then have the submission layer translate that to the appropriate options/JDL insert or whatever, as opposed to doing some sort of pass through of up to 1Kb of miscellaneous scheduler pap. Narrow the scope of the problem to something manageable, implement and test that and see if you need anything more. Also a comment on Dirks note that we should just provide *Ops with whatever capability: Thats kind of naive, the first thing they will do is use it, blow the crap out of something and then demand we support it, change it etc. Just like they always do, so anything you expose like this especially as it will sound like something that "optimises" things had better be well controlled and rock solid.

DMWMBot commented 12 years ago

mnorman: Replying to [comment:19 meloam]:

If someone had access to the scheduler (which they necessarily need, either locally or via the grid) then they can already do whatever they want. If they run off to idiotville and submit bad jobs, then they already had that option.

This is not actually true. For the glidein system the submitters attached to the Frontend are actually very well controlled, unless you have login privileges to the WMAgent node, you don't have access to the submitter. Even in CRAB3, where glExec forces us to have UIDs for each user, users don't actually have access to the submitter itself. Users can run jobs on the grid at large, but they can't get in the door to the WMS factories with their own certificate.

In theory this is unnecessary, because the submitter itself SHOULD be armored enough that if you did have access to it you could not compromise the system. However, the fact that they've taken pains to isolate it makes me very wary of allowing users to speak to it directly.

hufnagel commented 12 years ago

hufnagel: > Well, this thread turned into a monster...

I would look at adding a few predefined parameters for well known settings that can be mapped to appropriate batch system quantities like walltime, numCPUs, memory and then have the submission layer translate that to the appropriate options/JDL insert or whatever, as opposed to doing some sort of pass through of up to 1Kb of miscellaneous scheduler pap.

I need generic for the Tie0 One option I want to be able to pass through

"select[(model==nc_07_1 || model==e4_07_1) && type==SLC4_64] rusage[pool=33000, mem=1000]"

This is an actual option I have used in the past for Tie0testing. If you don't support that I have to find hacks to get around it.

Narrow the scope of the problem to something manageable, implement and test that and see if you need anything more.

Being able to pass misc. crap to the submitters is easily manageable. I would at this point punt on actually doing anything at all with the crap in the submitters. The request was way too vague to do anything with it.

There should be separate tickets for "every" option they want implemented and how they need to passed down in the grid system, with signoff from FacOps that the option is safe to use. Otherwise we'll end up debugging grid middleware.

Also a comment on Dirks note that we should just provide *Ops with whatever capability: Thats kind of naive,

I didn't say that. I said I need freedom to pass anything to pass to a submitter. What we then actually support in the submitter is a completely different ballgame.

DMWMBot commented 12 years ago

mnorman: Replying to [comment:22 hufnagel]:

I didn't say that. I said I need freedom to pass anything to pass to a submitter. What we then actually support in the submitter is a completely different ballgame.

We're still ultimately responsible for handling security for the submitter - which means that we then have to put in a lot of time making sure that everyone who writes a submitter does it according to spec for that submitter type. That basically requires some form of expert vetting, which I hoped to avoid.

It's possible that we can do validation that eliminates non-alphanumeric characters for general use, and then make that triggerable in the config. That would allow the T0 to run in unvalidated Insecure Mode, while preserving validation by default for the rest of the Agent instances.

evansde77 commented 12 years ago

evansde: Actually, it sounds like you need to write a specific submitter for the Tier 0, since you have to support a bunch of gymnastics for that particular single site use case, as opposed to making high level "generic" demnands on the entire system.

hufnagel commented 12 years ago

hufnagel: Replying to [comment:24 evansde]:

Actually, it sounds like you need to write a specific submitter for the Tier 0, since you have to support a bunch of gymnastics for that particular single site use case, as opposed to making high level "generic" demnands on the entire system.

Yes, that was always my plan.

I fully agree on the generic grid type submitters, what we allow there should be very restrictive, my comment was more of a technical nature, where in the code you would put in these roadblocks.

Two level parameter validation scheme, different inheritance chain for the Tie0submitter, I don't really care how it's done technically. As long as I can embed some options in the spec for a task and extract them at submission time. They don't even have to look as generic as my example, I can also map some complex LSF flags to a common option name and use that. Still need some flexibility to define and use new options though, they can't be completely locked down.

amaltaro commented 9 years ago

I guess we can close it since we now specify in the job classAds the expected WallTime, Expected disk usage, number of cores, memory required, etc @hufnagel @ticoann , should we close it?

amaltaro commented 9 years ago

ping

ticoann commented 9 years ago

This still might be relevant. Since we are having meeting about MaxWallTimeMin in a couple of day. I can create the new issue after finding out what comes out from the meeting and reference this issue.

amaltaro commented 9 years ago

I don't remember what came out of that discussion, but we do specify several parameters now. Shall we close it?