Make NonCustodialSites selection better

thongonary commented 5 years ago

At the moment NonCustodialSites is chosen randomly from available/allowed T1 sites. We should change the default setting to make it empty unless specified in the campaign configuration (such as for premix library production to be sent to FNAL and CERN).

thongonary commented 5 years ago

After a discussion with compops, a safer solution (but slightly more complicated) is to adjust the site selection algorithm in Unified so that the RAL situation where Unified doesn't know whether the output dataset will fit into the remaining disk space can be avoided.

At the moment the availability is computed from http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt (Availability = Quota*0.8-LastCopy)

To be followed up with @nsmith-

sharad1126 commented 5 years ago

@thongonary you mean to improve the algorithm that selects a site and assigns the parameter AutoApproveSubscriptionSites. https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignor.py#L515

thongonary commented 5 years ago

@sharad1126 Nope https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignor.py#L514 which eventually depends on this https://github.com/CMSCompOps/WmAgentScripts/blob/master/utils.py#L2617

Looks like it selects a list of available sites independent from the considered workflow, so it would be hard to take each dataset size into account.

@nsmith- Is the formula (Availability = Quota*0.8-LastCopy) correct? Based on the current info from http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt, RAL availability would be 5250*0.8-2987=1213 TB, but in reality it should be 0 TB?

nsmith- commented 5 years ago

Hi, Would it be possible to use availability = quota*0.8 - subscribed, where subscribed is taken as either cust_dest_bytes for tape sites, or noncust_dest_bytes for disk sites, from: https://cmsweb.cern.ch/phedex/datasvc/doc/nodeusage ? e.g. https://cmsweb.cern.ch/phedex/datasvc/json/prod/nodeusage?node=T1_UK_RAL_Disk would give 5250*0.8 - 4244, i.e. out of space. In this way, the metric counts projected usage due to subscriptions that have not finished transferring, rather than current usage.

nsmith- commented 5 years ago

Actually https://cmsweb.cern.ch/phedex/datasvc/doc/groupusage should be used as well, i.e. dest_bytes from https://cmsweb.cern.ch/phedex/datasvc/json/prod/groupusage?node=T1_UK_RAL_Disk&group=DataOps This way, both the group usage limit is respected, and in the case that the node is oversubscribed from other groups, that unified does not exacerbate the issue.

amaltaro commented 5 years ago

Nick, just to make sure I understand the output of this API. dest_bytes says how much data has been subscribed to that node+group, while node_bytes says how much data has been stored by that node+group; and their difference is what's left to be transferred.

Am I reading that api correctly?

nsmith- commented 5 years ago

Exactly

thongonary commented 5 years ago

Nick, so what should we use? dest_bytes from nodeusage? dest_bytes from groupusage? Both? (How?)

nsmith- commented 5 years ago

I would propose available = min(quota - node_usage['noncust_dest_bytes'], quota*0.8 - group_usage['dest_bytes'])

vlimant commented 5 years ago

You absolutely need a fudge factor to mitigate the "dest bytes" ; the way that is proposed here above is going to set many more site to available=0 in practice, while there is some dynamicaly available space. "dest bytes" tells you, in N days, there willl be that much data more at the site, but nothing tells you how much will be also gone by then, and nothing tells you N.

Taking into account oversubscribtion from other groups to limit dataops usage does not reasonable to me, if there is a quota for dataops, and dataops is not goind overbard, nothing else should prevent it from operating.

The 0.80 factor was exactly meant for this ; leaving 20% of the dataops quota for things to fluctuate, and as far as I can tell this has worked pretty well all along. Are you trying to change something because there was a transfer catastrophe at RAL (would sound like the issue needs to be solve somewhere else).

N.B. the choice of sites_out is done at random using that "available" metric as weight.

To solve THE problem that was raised here, all ingredients are available to have the estimated size of the final output dataset and prevent from using a site that has effectively, currently not enough space to accomodate for it.

nsmith- commented 5 years ago

In light of the fudge factor, I would say then you could remove the 0.8 and just say available = quota - group_usage['dest_bytes'], since the metric represents the upper limit on the amount of data that will be at the site in the coming days. The benefit is that if the queue starts to have runaway growth for whatever reason, unified will not exacerbate the issue. Now, indeed for other group space we probably should not consider (at the least, because quota for unified is ~1/3 of total space allocated to a site) So I would revise my suggestion to available = quota - group_usage['dest_bytes'] It would also be very nice if avaliable - expected_output_size > 0 was checked.

vlimant commented 5 years ago

using "expected_output_size" is the only reasonable way

haozturk commented 3 years ago

Output data placement was removed from Unified and therefore closing this issue.

CMSCompOps / WmAgentScripts

Make NonCustodialSites selection better #439