dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Update the spec parameters (MaxMemory, etc) for running workflow. #8622

Open ticoann opened 6 years ago

ticoann commented 6 years ago

As discussed with Alan,

  1. add the time stamp in couch db reqmgr when spec file is updated.
  2. add new table with workflow id and updated timestamp.
  3. Job updater compares the time stamp in wmbs (table above) and reqmgr2 couchdb. if reqmgr2 couchdb record is newer, update the specs in the disk (JobCache, sandbox),
  4. update the wmbs table.
amaltaro commented 6 years ago

See #8646 for further details

amaltaro commented 5 years ago

Given that we initially thought about these changes more on the resource requirements land (can be then extended to site lists and etc), it would be interesting to know how Unified does the workflow/job tweak in order to better use grid resources.

@vlimant can you give us a brief explanation on how it's done in Unified (live resource requirements update)? Which services/API is used and whether all workflows are under this monitoring? Or only what is configured for? Trying to evaluate how much we'd gain by implementing this in WMCore...

amaltaro commented 5 years ago

@vlimant I'm planning to work on this ticket in the coming weeks. Unless you think Unified mechanism is good enough and we don't need it. So your input and answer to the questions I asked above would be highly appreciated.

vlimant commented 5 years ago

@amaltaro there are several other candidates for unified integration already (#8914, #8921, #8920, #8324, ...) ; I believe those are the ones we put together as first thing of the integration.

The mechanism for classad tweaking in unified is all in https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/equalizor.py and is depending on gwmsmon (although everything can be retrieved from ES directly). It will likely require further documentation of what is done exactly

amaltaro commented 5 years ago

Ok, none of the issues you pointed out are straight forward. But eventually we have to get them started... If you can get this equalizor properly documented, it will be certainly helpful in the near future.

sharad1126 commented 4 years ago

@amaltaro , According to a small discussion with James today morning, these computations(memory tuning) are a little expensive and increase the loads on the schedd. He mentioned that @todor-ivanov tried implementing something like this in CRAB3 schedds which made the schedds slower. So the best place could be to implement this directly at the condor level(probably in a schedd attached to negotiator) and can be a feature request to the condor developers. May be we can ask about this in the next condor developers meeting. @dpiparo FYI

bbockelm commented 4 years ago

@sharad1126 - I’m not sure that comment makes much sense. Without knowing the exact thing Alan is planning, it could be almost no load - or very expensive.

In fact, if done right, this could be much more efficient than the current system because one could affect all idle jobs in a single transaction instead of doing it one-by-one (like Unified does today).

sharad1126 commented 4 years ago

@bbockelm I discussed about this with @amaltaro and then I discussed this with James Letts and James told me what I exactly mentioned in the above comment. Of course it is a good idea to get this done which would help us making the system more efficient.

amaltaro commented 4 years ago

What I have in mind is actually an update of the workflow spec file, such that jobs still waiting in the global workqueue (or waiting for the agent job splitting) could use the up-to-date parameters, thus stopping the usage of JobRouter.

In the next phase of this tunning, we could also update jobs pending in the local condor queue (basically the same process as done for RequestPriority/JobPrio).

I believe those 2 approaches are not tightly coupled and can be delivered in different stages.