dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

ID of files without replica #1095

Open sidnarayanan opened 6 years ago

sidnarayanan commented 6 years ago

An initial check identified 1.2M blocks in TMDB that do not have complete replicas [1]. Many of these files are intentionally deleted (deletion campaigns, datasets with finite lifetime, testing, etc). There should be a mechanism by which we can easily identify such files and invalidate them. These missing files lead to stuck transfers, which are very painful to identify and clean up manually. Ideally, the invalidation is not totally automatic (to prevent invalidation of /store/data that might be important).

[1] http://t3serv001.mit.edu/~snarayan/misc/missing_blocks.txt

nataliaratnikova commented 6 years ago

Hi Sid, what procedure did you use to get the list of blocks? And how to interpret the output, e.g. what is the difference between None and 0 in the last column: /SingleElectron/Run2016E-PromptReco-v2/RECO%23631b4b68-50c7-11e6-b709-001e67abf228 9090970 None and /SingleElectron/Run2016D-HcalCalIsoTrkFilter-23Sep2016-v1/ALCARECO%23c9d32bc6-beba-11e6-9a38-02163e01820e 6120303 0 Thanks, Natalia.

sidnarayanan commented 6 years ago

I used the dynamo database, which has a full record of block replicas, to look for blocks that have no incomplete replicas. None in the last column means there are no subscriptions whatsoever, whereas 0 means there is an incomplete subscription. In either case, the block is not present anywhere on disk or tape.

nataliaratnikova commented 6 years ago

Hi Sid, PhEDEx blockarrive API https://cmsweb.cern.ch/phedex/datasvc/doc/blockarrive provides an easy way to produce a list of blocks that have at least one file with no replica, see a comment to basis=-6 . I attach a list of blocks obtained with this method, also put in my public at CERN: -bash-4.1$ zcat ~ratnik/public/forSid/missing_blocks_basis-6.gz | wc -l 2207

As you see, it is much shorter than your original list: 2.2k vs 1.2M. I wonder if such drastic difference is because you are in the process of invalidating the missing files, or difference in the procedures, or in the definition of "no replica/no source" blocks? Can you tell how my list (taken today, Aug 21st) compares with your current records? Thanks, Natalia. missing_blocks_basis-6.gz

sidnarayanan commented 6 years ago

Hi Natalia,

The blockarrive API only shows block that PhEDEx is trying to transfer somewhere. That list will change from day to day. I'm hoping to clean up all missing files from TMDB, instead of having a different list to deal with each week.

-Sid