dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Inconsistency between Unified and MSTransferor for the Out of Space Disk Sites Information #9877

Closed haozturk closed 3 years ago

haozturk commented 4 years ago

Impact of the bug Unified Assignment & Weekly Reporting of Disk Space Information

Describe the bug Currently, there are two data sources to get a list of out of space disk sites:

  1. Unified Link which uses http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt
  2. MSTransferor Link

This information is used by unified in the assignment of workflows and it is also reported in the weekly CompOps meeting, therefore it is important to have the correct information.

There are sites which is included in the Unified list, but not included in the MSTransferor list and vice versa.

Here is how unified calculates this information: https://github.com/CMSCompOps/WmAgentScripts/blob/c78be9aea4a374b5b021fc8688a48ceb94ad1eb1/Unified/htmlor.py#L1172-L1190 https://github.com/CMSCompOps/WmAgentScripts/blob/master/utils.py#L2436-L2440

In plain English, it gets the quota and locked/lastCopy from here and it gets buffer_level information from Unified Configuration which is set to 0.8 currently and if quota*buffer_level - locked < 0 it marks the site as not available.

We do not have a good knowledge how MSTransferor gathers this information. Therefore your comments are important here @amaltaro and @todor-ivanov

How to reproduce it See the two data sources and compare.

Expected behavior We need to have a reliable source for Disk space information.

amaltaro commented 4 years ago

Hi @haozturk , MSTransferor has a similar logic for calculating the available space at the sites, however it is not an original copy of the Unified logic, thus it can indeed have different sites listed as out-of-space. One of those current differences at the moment, is that MSTransferor is supposed to use up to 90% of the quota available (while Unified is set to 80%).

About the MSTransferor logic, these are the 3 WMCore APIs to collect the site quotas and calculate the remaining space for DataOps: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/Unified/MSTransferor.py#L125-L127

which are taken from this module: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/Unified/RSEQuotas.py

In plain english, the logic is:

Can you please clarify what this information is used for in the workflow assignment?

In my opinion, these numbers do not need to be exactly the same because they are different systems and not meant to be a mirror of each other. In addition to that, I'd say that reporting Storage information in the CompOps meeting should be carried by the Data Management team, not P&R.

BTW, you probably don't know yet, but the WMCore team is working to own all the data management features from Unified, which means Unified will not need to know data location, to know storage status, transfer progress, etc.

sharad1126 commented 4 years ago

@haozturk @amaltaro @todor-ivanov https://github.com/CMSCompOps/WmAgentScripts/issues/623 fyi

sharad1126 commented 4 years ago

@amaltaro if you remember, unified is the one to give a sitewhitelist and the noncustodial site and therefore, it does checks the availability of disk space before assigning that site, if that makes sense to you.

amaltaro commented 4 years ago

For the SiteWhitelist, this information should not make any differece.

For the non-custodial site, I'm in favor of stopping this during assignment. Maybe we could stop right away? Or we could stop once we have Rucio completely in production (such that we can leave that decision to Rucio)? Perhaps Nick @nsmith- has an input here.

nsmith- commented 4 years ago

In the case that unified does not fill NonCustodialSites, does it instead create a phedex subscription for the output to a specific site? I cannot recall..

amaltaro commented 4 years ago

AFAIK Unified does a DDM (NonCusdodial) output data placement when closing-out a workflow (in addition to the tape (PhEDEx?) subscription). If NonCustodialSites sites is not provided, then the agent will certainly not make any container level disk data placement.

nsmith- commented 4 years ago

Can we check what fraction of workflows currently do set NonCustodialSites? If I understand correctly, I expect it to be the majority. In which case we should probably make sure it is always set and then during the transition to rucio we can simply start to ignore this parameter, at least for the purposes of satisfying the "grouping=DATASET" rule (#2 in our list).

klannon commented 4 years ago

@nsmith- Is this something that folks on the Unified side can look into? @haozturk or @z4027163?

z4027163 commented 4 years ago

@nsmith- @klannon As for now, all the WFs have NonCustodialSites set by the unified.

amaltaro commented 3 years ago

Given that we have started migrating the DM system to Rucio, and that Unified no longer knows anything about quota, space used and available; shall we close this issue? If not yet, then could someone please clarify what the actionable from this item is?

haozturk commented 3 years ago

Hi Alan, I also do not see any reason to keep this issue after the Rucio transition. We can close it.

amaltaro commented 3 years ago

I'm closing it then, thanks Hasan.