Closed haozturk closed 3 years ago
Hi @haozturk , MSTransferor has a similar logic for calculating the available space at the sites, however it is not an original copy of the Unified logic, thus it can indeed have different sites listed as out-of-space. One of those current differences at the moment, is that MSTransferor is supposed to use up to 90% of the quota available (while Unified is set to 80%).
About the MSTransferor logic, these are the 3 WMCore APIs to collect the site quotas and calculate the remaining space for DataOps: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/Unified/MSTransferor.py#L125-L127
which are taken from this module: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/Unified/RSEQuotas.py
In plain english, the logic is:
dest_bytes
, basically what has been already allocated to DataOps on that site.Can you please clarify what this information is used for in the workflow assignment?
In my opinion, these numbers do not need to be exactly the same because they are different systems and not meant to be a mirror of each other. In addition to that, I'd say that reporting Storage information in the CompOps meeting should be carried by the Data Management team, not P&R.
BTW, you probably don't know yet, but the WMCore team is working to own all the data management features from Unified, which means Unified will not need to know data location, to know storage status, transfer progress, etc.
@haozturk @amaltaro @todor-ivanov https://github.com/CMSCompOps/WmAgentScripts/issues/623 fyi
@amaltaro if you remember, unified is the one to give a sitewhitelist and the noncustodial site and therefore, it does checks the availability of disk space before assigning that site, if that makes sense to you.
For the SiteWhitelist, this information should not make any differece.
For the non-custodial site, I'm in favor of stopping this during assignment. Maybe we could stop right away? Or we could stop once we have Rucio completely in production (such that we can leave that decision to Rucio)? Perhaps Nick @nsmith- has an input here.
In the case that unified does not fill NonCustodialSites, does it instead create a phedex subscription for the output to a specific site? I cannot recall..
AFAIK Unified does a DDM (NonCusdodial) output data placement when closing-out a workflow (in addition to the tape (PhEDEx?) subscription). If NonCustodialSites sites is not provided, then the agent will certainly not make any container level disk data placement.
Can we check what fraction of workflows currently do set NonCustodialSites? If I understand correctly, I expect it to be the majority. In which case we should probably make sure it is always set and then during the transition to rucio we can simply start to ignore this parameter, at least for the purposes of satisfying the "grouping=DATASET" rule (#2 in our list).
@nsmith- Is this something that folks on the Unified side can look into? @haozturk or @z4027163?
@nsmith- @klannon As for now, all the WFs have NonCustodialSites set by the unified.
Given that we have started migrating the DM system to Rucio, and that Unified no longer knows anything about quota, space used and available; shall we close this issue? If not yet, then could someone please clarify what the actionable from this item is?
Hi Alan, I also do not see any reason to keep this issue after the Rucio transition. We can close it.
I'm closing it then, thanks Hasan.
Impact of the bug Unified Assignment & Weekly Reporting of Disk Space Information
Describe the bug Currently, there are two data sources to get a list of out of space disk sites:
This information is used by unified in the assignment of workflows and it is also reported in the weekly CompOps meeting, therefore it is important to have the correct information.
There are sites which is included in the Unified list, but not included in the MSTransferor list and vice versa.
Here is how unified calculates this information: https://github.com/CMSCompOps/WmAgentScripts/blob/c78be9aea4a374b5b021fc8688a48ceb94ad1eb1/Unified/htmlor.py#L1172-L1190 https://github.com/CMSCompOps/WmAgentScripts/blob/master/utils.py#L2436-L2440
In plain English, it gets the
quota
andlocked/lastCopy
from here and it getsbuffer_level
information from Unified Configuration which is set to0.8
currently and ifquota*buffer_level - locked < 0
it marks the site as not available.We do not have a good knowledge how MSTransferor gathers this information. Therefore your comments are important here @amaltaro and @todor-ivanov
How to reproduce it See the two data sources and compare.
Expected behavior We need to have a reliable source for Disk space information.