Open amaltaro opened 6 years ago
From a feedback from Hasan, this feature will be useful in diverse scenarios, such as: 1) sites that are planning to destructive (not saving any data in it) remake their storage. They would be able to remove those sites from the site list well in advance, giving it time for those workflows to potentially complete. 2) Workflows that might be misbehaving and/or causing site issues 3) Sites going - or coming out of - long downtime period.
This is actually a meta issue that will require, at the very least, the following developments:
It is important to mention that there are many places where a data race / race condition can happen, such as:
I will be creating the other issues in the coming days and will set this one as a meta issue.
From a P&R discussion last week, this got demoted in favor of wmcore_pileup developments. Now medium prio.
Based on the O&C weekly meeting discussion that took place today, it looks like the P&R team would be happy if we could deliver the initial sub-optimal feature in Q3/2024, mentioned in the issue description as: """ 1) when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in Available status. """
The list of sub-tickets that we need to consider are:
SiteWhitelist
and SiteBlacklist
for workflows that are in a state between assigned
and running-closed
. Not accept it though if the state is staged
, such that we can avoid a data race condition (global workqueue with an outdated workflow spec). This update needs to reflect both the JSON document as well as the workload spec object.Available
, for that given workflow. We should likely also update the workload spec persisted in the workqueue (_inbox) database.With potentially the 5 items above, we can deliver a very first version of this feature, which will update site lists in any work that has NOT yet been acquired by any agents. WorkQueue elements and jobs already materialized in the agents would go through the system without considering the site list update.
I appreciate any feedback that people might have, especially for functionality/services that I might be missing here.
Hello @amaltaro,
Thank you for providing the list of sub-tickets.
I just want to confirm my understanding of the description. In point #2, you mentioned that ReqMgr will be modified to allow updates to SiteWhitelist and SiteBlacklist between the 'assigned' and 'running-closed' states but not in the 'staged' state. However, from my understanding, a workflow transitions to 'running-closed' once all its Work Queue Elements (WQEs) are picked up by an agent. This implies that changing the SiteWhitelist when the workflow is in the 'running-closed' state would not actually affect where the jobs run. Is that correct?
I understand that this is something that would be tackled in the second part of the issue description i.e
2. when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.
Thanks a lot @amaltaro and @anpicci! This is much needed. It's reasonable to approach this request in two steps and focus on the first step in Q3. Probably we'll discuss each step in its own issue, but let me make a quick comment for 1.c: You cannot update (update-rule
) the rse expression of a rule and keep the same rule id. You can change the rse expression by "moving" (move-rule
) a rule which creates a new rule.
Hi @haozturk @hassan11196 , thank you for your prompt feedback (and Andrea). Your both points are valid and they will be considered when we materialize these 5 points into their own GH tickets. Once that is done, I will also update the initial description of this PR, such that it becomes a meta-issue and we can track all of the sub-items to be developed. Thanks again!
Impact of the new feature ReqMgr2, Global WorkQueue, Local WorkQueue
Is your feature request related to a problem? Please describe. This is especially important for long living workflows, where sites might come and go and further tweak of the site lists could be important.
Describe the solution you'd like Support update to the
SiteWhitelist
andSiteBlacklist
for workflows that have already been assigned (being between assigned and running-closed).There are two steps that can be taken for this: 1) when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in
Available
status; 2) when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.The following tickets have been materialized for option 1) above:
Describe alternatives you've considered In addition to the steps above, I think we will have to update the relational database as well, including jobs already created bu queued in JobSubmitter.
Additional context None