dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Support SiteWhitelist/SiteBlacklist update for active workflows #8323

Open amaltaro opened 6 years ago

amaltaro commented 6 years ago

Impact of the new feature ReqMgr2, Global WorkQueue, Local WorkQueue

Is your feature request related to a problem? Please describe. This is especially important for long living workflows, where sites might come and go and further tweak of the site lists could be important.

Describe the solution you'd like Support update to the SiteWhitelist and SiteBlacklist for workflows that have already been assigned (being between assigned and running-closed).

There are two steps that can be taken for this: 1) when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in Available status; 2) when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.

The following tickets have been materialized for option 1) above:

Describe alternatives you've considered In addition to the steps above, I think we will have to update the relational database as well, including jobs already created bu queued in JobSubmitter.

Additional context None

amaltaro commented 1 year ago

From a feedback from Hasan, this feature will be useful in diverse scenarios, such as: 1) sites that are planning to destructive (not saving any data in it) remake their storage. They would be able to remove those sites from the site list well in advance, giving it time for those workflows to potentially complete. 2) Workflows that might be misbehaving and/or causing site issues 3) Sites going - or coming out of - long downtime period.

This is actually a meta issue that will require, at the very least, the following developments:

It is important to mention that there are many places where a data race / race condition can happen, such as:

I will be creating the other issues in the coming days and will set this one as a meta issue.

amaltaro commented 1 year ago

From a P&R discussion last week, this got demoted in favor of wmcore_pileup developments. Now medium prio.

amaltaro commented 2 months ago

Based on the O&C weekly meeting discussion that took place today, it looks like the P&R team would be happy if we could deliver the initial sub-optimal feature in Q3/2024, mentioned in the issue description as: """ 1) when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in Available status. """

The list of sub-tickets that we need to consider are:

  1. relocation of the relevant input data. At the moment there is no mechanism in MSTransferor to make it re-evaluate a given workflow/input data placement. So we will have to implement a new mechanism to make this step automated. This feature itself can be implemented with two different levels of quality: a) trigger a new data placement with the new site list b) in addition to the new data placement, also trigger a data deletion for the site that has been removed from the site whitelist - if any. c) similar to a), but instead of making a new rule, we could consume the rule ids already persisted in the database (via MSTransferor/MSMonitor) and update their RSE expression accordingly. NOTE: this item itself could easily spawn 2 or 3 tickets, depending on the decisions we make and advice from DM experts...
  2. Change ReqMgr2 behavior such that it allows update of the fields SiteWhitelist and SiteBlacklist for workflows that are in a state between assigned and running-closed. Not accept it though if the state is staged, such that we can avoid a data race condition (global workqueue with an outdated workflow spec). This update needs to reflect both the JSON document as well as the workload spec object.
  3. Coupled to the ReqMgr2 action listed in item 2., we also need to make a call to Global WorkQueue and update every single workqueue element that is in status Available, for that given workflow. We should likely also update the workload spec persisted in the workqueue (_inbox) database.
  4. It is not clear to me whether we would have to update the workload spec that has already been download in a given agent - in case the workflow is already running. This requires further investigation. If needed, then we need to implement it in one of the components. How to detect that the spec file changed???
  5. Optional: do we want to keep a history of such changes? If so, which information needs to be persisted? Probably DN, timestamp, list of sites added, list of sites removed. Anything else?

With potentially the 5 items above, we can deliver a very first version of this feature, which will update site lists in any work that has NOT yet been acquired by any agents. WorkQueue elements and jobs already materialized in the agents would go through the system without considering the site list update.

I appreciate any feedback that people might have, especially for functionality/services that I might be missing here.

hassan11196 commented 2 months ago

Hello @amaltaro,

Thank you for providing the list of sub-tickets.

I just want to confirm my understanding of the description. In point #2, you mentioned that ReqMgr will be modified to allow updates to SiteWhitelist and SiteBlacklist between the 'assigned' and 'running-closed' states but not in the 'staged' state. However, from my understanding, a workflow transitions to 'running-closed' once all its Work Queue Elements (WQEs) are picked up by an agent. This implies that changing the SiteWhitelist when the workflow is in the 'running-closed' state would not actually affect where the jobs run. Is that correct?

I understand that this is something that would be tackled in the second part of the issue description i.e

2. when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.
haozturk commented 2 months ago

Thanks a lot @amaltaro and @anpicci! This is much needed. It's reasonable to approach this request in two steps and focus on the first step in Q3. Probably we'll discuss each step in its own issue, but let me make a quick comment for 1.c: You cannot update (update-rule) the rse expression of a rule and keep the same rule id. You can change the rse expression by "moving" (move-rule) a rule which creates a new rule.

amaltaro commented 2 months ago

Hi @haozturk @hassan11196 , thank you for your prompt feedback (and Andrea). Your both points are valid and they will be considered when we materialize these 5 points into their own GH tickets. Once that is done, I will also update the initial description of this PR, such that it becomes a meta-issue and we can track all of the sub-items to be developed. Thanks again!