Closed thongonary closed 3 years ago
After a discussion with compops, a safer solution (but slightly more complicated) is to adjust the site selection algorithm in Unified so that the RAL situation where Unified doesn't know whether the output dataset will fit into the remaining disk space can be avoided.
At the moment the availability is computed from http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt (Availability = Quota*0.8-LastCopy)
To be followed up with @nsmith-
@thongonary you mean to improve the algorithm that selects a site and assigns the parameter AutoApproveSubscriptionSites. https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignor.py#L515
@sharad1126 Nope https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignor.py#L514 which eventually depends on this https://github.com/CMSCompOps/WmAgentScripts/blob/master/utils.py#L2617
Looks like it selects a list of available sites independent from the considered workflow, so it would be hard to take each dataset size into account.
@nsmith- Is the formula (Availability = Quota*0.8-LastCopy)
correct? Based on the current info from http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt, RAL availability would be 5250*0.8-2987=1213 TB, but in reality it should be 0 TB?
Hi,
Would it be possible to use availability = quota*0.8 - subscribed
, where subscribed
is taken as either cust_dest_bytes
for tape sites, or noncust_dest_bytes
for disk sites, from: https://cmsweb.cern.ch/phedex/datasvc/doc/nodeusage ?
e.g. https://cmsweb.cern.ch/phedex/datasvc/json/prod/nodeusage?node=T1_UK_RAL_Disk would give 5250*0.8 - 4244
, i.e. out of space.
In this way, the metric counts projected usage due to subscriptions that have not finished transferring, rather than current usage.
Actually https://cmsweb.cern.ch/phedex/datasvc/doc/groupusage should be used as well,
i.e. dest_bytes
from https://cmsweb.cern.ch/phedex/datasvc/json/prod/groupusage?node=T1_UK_RAL_Disk&group=DataOps
This way, both the group usage limit is respected, and in the case that the node is oversubscribed from other groups, that unified does not exacerbate the issue.
Nick, just to make sure I understand the output of this API. dest_bytes
says how much data has been subscribed to that node+group, while node_bytes
says how much data has been stored by that node+group; and their difference is what's left to be transferred.
Am I reading that api correctly?
Exactly
Nick, so what should we use? dest_bytes
from nodeusage
? dest_bytes
from groupusage
? Both? (How?)
I would propose available = min(quota - node_usage['noncust_dest_bytes'], quota*0.8 - group_usage['dest_bytes'])
You absolutely need a fudge factor to mitigate the "dest bytes" ; the way that is proposed here above is going to set many more site to available=0 in practice, while there is some dynamicaly available space. "dest bytes" tells you, in N days, there willl be that much data more at the site, but nothing tells you how much will be also gone by then, and nothing tells you N.
Taking into account oversubscribtion from other groups to limit dataops usage does not reasonable to me, if there is a quota for dataops, and dataops is not goind overbard, nothing else should prevent it from operating.
The 0.80 factor was exactly meant for this ; leaving 20% of the dataops quota for things to fluctuate, and as far as I can tell this has worked pretty well all along. Are you trying to change something because there was a transfer catastrophe at RAL (would sound like the issue needs to be solve somewhere else).
N.B. the choice of sites_out is done at random using that "available" metric as weight.
To solve THE problem that was raised here, all ingredients are available to have the estimated size of the final output dataset and prevent from using a site that has effectively, currently not enough space to accomodate for it.
In light of the fudge factor, I would say then you could remove the 0.8
and just say available = quota - group_usage['dest_bytes']
, since the metric represents the upper limit on the amount of data that will be at the site in the coming days. The benefit is that if the queue starts to have runaway growth for whatever reason, unified will not exacerbate the issue. Now, indeed for other group space we probably should not consider (at the least, because quota
for unified is ~1/3 of total space allocated to a site) So I would revise my suggestion to
available = quota - group_usage['dest_bytes']
It would also be very nice if avaliable - expected_output_size > 0
was checked.
using "expected_output_size" is the only reasonable way
Output data placement was removed from Unified and therefore closing this issue.
At the moment NonCustodialSites is chosen randomly from available/allowed T1 sites. We should change the default setting to make it empty unless specified in the campaign configuration (such as for premix library production to be sent to FNAL and CERN).