dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Why pull off tape when many disk replicas exist? #1100

Closed DAMason closed 6 years ago

DAMason commented 6 years ago

Was trying to understand why at FNAL we currently have a large recall queue by looking at what was being asked for (at the moment am looking at active encp's on the pools). The list of datasets being asked for is a bit nonintuitive -- for example the top dataset at the moment is:

For dataset: /ZeroBias/Run2016H-PromptReco-v2/AOD Site: T2_FR_GRIF_IRFU Complete 100.0 Site: T2_US_UCSD Complete 100.0 Site: T1_US_FNAL_Buffer Complete 100.0 Site: T2_UK_SGrid_Bristol Complete 100.0 Site: T2_US_MIT Complete 100.0 Site: T1_US_FNAL_MSS Complete 100.0 Site: T2_IT_Legnaro Complete 58.8 Site: T1_DE_KIT_Disk Complete 100.0 Site: T2_FI_HIP Complete 2.9 Site: T2_BR_UERJ Complete 99.1 Site: T2_FR_GRIF_LLR Complete 95.1 Site: T1_US_FNAL_Disk Complete 100.0

There are many replicas on reliable disk sources (FNAL disk, US T2's, other T1's...) Why should we ever be also trying to pull this off of tape? Is there a link weighting problem from tape sources somewhere?

sidnarayanan commented 6 years ago

CNAF is also observing the same thing. Not yet clear what's causing it. There haven't been any changes in the router config recently as far as I know...

nataliaratnikova commented 6 years ago

Looking at the routing activity table for this dataset [1], nothing is currently routed from any Tier-1. The only destination for this DS currently is T2_FR_GRIF_LLR, and all blocks are routed from T2 sites.

Could triple-A be causing this?

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=.*&priority=any&blockfilter=%2FZeroBias%2FRun2016H-PromptReco-v2%2FAOD&.submit=Update#

DAMason commented 6 years ago

xrootd should know nothing about the tape instance at FNAL certainly.

On Sep 28, 2017, at 1:54 PM, nataliaratnikova notifications@github.com wrote:

Looking at the routing activity table for this dataset [1], nothing is currently routed from any Tier-1. The only destination for this DS currently is T2_FR_GRIF_LLR, and all blocks are routed from T2 sites.

Could triple-A be causing this?

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=.*&priority=any&blockfilter=%2FZeroBias%2FRun2016H-PromptReco-v2%2FAOD&.submit=Update#

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

DAMason commented 6 years ago

These files have a large number of attempts — could the encp’s I’m seeing be from earlier attempts that PhEDEx gave up on? (which would actually be worse — we’d be staging tapes doubly unneeeded)

On Sep 28, 2017, at 1:57 PM, David A Mason dmason@fnal.gov wrote:

xrootd should know nothing about the tape instance at FNAL certainly.

On Sep 28, 2017, at 1:54 PM, nataliaratnikova notifications@github.com wrote:

Looking at the routing activity table for this dataset [1], nothing is currently routed from any Tier-1. The only destination for this DS currently is T2_FR_GRIF_LLR, and all blocks are routed from T2 sites.

Could triple-A be causing this?

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=.*&priority=any&blockfilter=%2FZeroBias%2FRun2016H-PromptReco-v2%2FAOD&.submit=Update#

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sidnarayanan commented 6 years ago

Yeah, I'm guessing PhEDEx gave up on the ones from that dataset. If I look at what is currently routed from FNAL_Buffer, it's tons of stuff that shouldn't be coming from tape [1]. Lots of 2017 data, (MINI)AOD(SIM), etc. Picking one at random, I see 3 full disk copies, and yet it is routed from FNAL_MSS [2].

[1] https://cmsweb.cern.ch/phedex/prod/Activity::Routing?tofilter=.*&fromfilter=T1_US_FNAL_Buffer&priority=any&showinvalid=on&blockfilter=&.submit=Update# [2] https://cmsweb.cern.ch/phedex/datasvc/json/prod/subscriptions?dataset=/JetHT/Run2016G-07Aug17-v1/MINIAOD

nataliaratnikova commented 6 years ago

Sid, thanks for the example. I do not think router can decide based on the dataset name (AOD, etc). I will see if I can figure out about the links weights from the router agent log.

Dave, you should be able to see from the local stager agent logs whether and when it tried to re-stage the file. By default stager will "forget" about the staged files after 8 hours, you may adjust this using -stage-stale option: https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileStager#L49-L50

sidnarayanan commented 6 years ago

The dataset name should not have anything to do with what the router decides. I was trying to point out that these are data tiers that are already replicated on disk, and therefore should not be recalled form tape.

nataliaratnikova commented 6 years ago

Okay, I got your point. As far as I can tell, Router considers all available sources, including T1_*_Buffer nodes, and chooses a link with minimal cost. It simply adds a half-an hour penalty for the files that need staging: https://github.com/dmwm/PHEDEX/blob/master/perl_lib/PHEDEX/Infrastructure/FileRouter/Agent.pm#L1044-L1051 If you want the disk-only sources to outweigh the Buffer nodes, we could try to adjust this penalty.

vlimant commented 6 years ago

+1 on making this penalty 10 thousand hours to prevent tape copies from being considered a good source

DAMason commented 6 years ago

Seeing again today encp recalls at FNAL very much dominated by things that also exist on disk, even at FNAL_Disk. I would bet fixing this goes a long way to settle any tape recall problems CMS has -- should set the penalty somewhere high enough that all functional disk replicas are tried first, but not so high to prevent exclusion of a tape replica when only disk replicas at broken/very backlogged sites are available. Not knowing the distribution I won't offer a number :) I would put addressing this at high prio, possibly just behind the secret 4th queue.

DAMason commented 6 years ago

Actually now I go and look at the code @nataliaratnikova referenced -- assuming half an hour for unstaged data is ridiculous! Half a day or a day is maybe as low as I would ever have thought there. Maybe the real number something like longer than 90% of the "from disk" transfer latencies? But I guess I'd need to see what that cost function looks like. Is there a data service query to see what these numbers look like?

nataliaratnikova commented 6 years ago

Hi Dave, https://cmsweb.cern.ch/phedex/datasvc/perl/prod/routerhistory shows the last hour numbers for rate and latency used in the cost calculation per link. See https://cmsweb.cern.ch/phedex/datasvc/doc/routerhistory for more filters.

In the last hour the latency varies from 0 to 7days.. .

I'll see how easy it would be to pass the staging penalty to the Router as an option, instead of a hard-coded value.

DAMason commented 6 years ago

Just checking where we are on this?

nataliaratnikova commented 6 years ago

I'm done with new priority queue. This one is next on my list. If you figured out the desired number, I can put it right away as a new default. Since this is a trivial change, we could also ask T0 PhEDEx operators to patch the FileRouter in place to put this feature in action.