Open vkuznet opened 1 year ago
We can limp along for some time with a script which does this, e.g. [1] It has the advantage that I can e.g. filter site list by which sites are available to CRAB and play with what-to-look-for-and-to-show. Once we can define some clear format, we can reconsider adding to DAS.
[1] https://github.com/dmwm/CRABServer/blob/master/scripts/Utils/CheckDiskAvailability.py example:
belforte@lxplus704/~> python3 CheckDiskAvailability.py --dataset /ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
Checking disk availability of dataset: /ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
only blocks fully replicated are listed
block: 89
dataset has 89 blocks
0 blocks have 0 disk replicas
12 blocks have 1 disk replicas
63 blocks have 2 disk replicas
14 blocks have 3 disk replicas
Site location
T2_BR_SPRACE hosts 28 blocks
T2_IT_Rome hosts 41 blocks
T1_IT_CNAF_Disk hosts 34 blocks
T2_CH_CSCS hosts 60 blocks
T1_RU_JINR_Disk hosts 5 blocks
T2_PL_Swierk hosts 8 blocks
T2_RU_JINR hosts 1 blocks
T2_CN_Beijing hosts 1 blocks
T2_TR_METU hosts 1 blocks
T2_IT_Pisa hosts 1 blocks
belforte@lxplus704/~>
@belforte , I made small adjustment to DAS code and now it can show number of blocks and files per site. The UI will look like this
and, CLI output in json format will have corresponding attributes, e.g.
d=/ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
dasgoclient -query="site dataset=$d" -json
...
"site": [
{
"block_completion": "31.46%",
"block_fraction": "100.00%",
"dataset_fraction": " 0.00%",
"kind": "DISK",
"name": "T2_BR_SPRACE",
"nblocks": 28,
"nfiles": 5021,
"replica_fraction": "100.00%",
"se": "T2_BR_SPRACE"
}
Does it enough to cover this use-case? So far I did not put effort to present X blocks have Y disk replicas
as shown in your python script since it will require more coding and I do not know if it is relevant for end-users.
looks good to me, though I'd prefer to show number of blocks as fraction of the total like e.g.89/194
, 28/194
etc. The important thing is to make sure that you count and show fully replicated blocks.
Maybe also write Number of complete blocks
. Unfortunately nblocks
leads to ambiguity.
ok, I can do what you ask:=, e.g.
But I need further clarification what is definition of complete
block vs fully replicated
blocks? To me, fully replicated
means that all files from that block are at a site, e.g. if block has X files, all X files are at that site. But what is complete means in this case?
And, I also assume we are talking about valid files, since block may have invalid files too. So, the fully replicated means actually that all valid files are replicated to that site, right?
sorry, I used complete to mean fully replicated :-( . To be precise the definition is that the the Rucio dataset (aka block here) has state AVAILABLE. From what I know so far, that may contain invalid files as well, i.e. w/o replicas (lost files e.g.). I do not know how exactly Rucio behaves when files are invalidated. Good question !
@belforte , new changes are deployed to cmsweb-testbed DAS server. Feel free to use it and provide me a feedback over here. Then, I can deploy it to production.
Can you make it clear that number of blocks used in "block presence" is not the same as reported in second row as "number of blocks" ? IIUC in the end you print the same as "block presence" in the line above, simply as fraction instead of percentual.
Side note: the dataset in the original example is not in cmsweb-testbed (int instance of DBS), so I looked up https://cmsweb-testbed.cern.ch/das/request?instance=int/global&input=site+dataset%3D%2FParkingBPH1%2FRun2018D-05May2019promptD-v1%2FAOD and am curious about the report for CERN_Tape number of files number of files 100943/100955
since all blocks are fully replicated there, how can some files be missing ? Where do those two numbers come from ? Invalid files ? Files w/o a replica ? bug ?
Stefano, I'm not sure I understood your first part of the reply, please rephrase it differently, i.e. just show how you will present this info.
For the second part, in testbed it shows total number of files in dataset rather than valid ones. We need to agree of what to use, should we report total number of valid files or total files in a dataset.
I'd change
Number of blocks 1/578 number of files 9/100955
to
Fully replicated blocks: 1/578 File replicas (only valid files): 9/100955
What always confuse people is the block presence: number of blocks at the site
/ number of blocks in the dataset
which is often 100% . This number of blocks at the site
is not a well-known, well-defined concept.
E.g. look at this (from this DAS page )
it that RSE has only 9 files, how can Block presence be 100 % ??
Well, a dataset may have 578 blocks, then 1 block may have 9 files, and other blocks will have the rest of the files. If the first block is at a site and all its files are there the block presence is 100%. The block presence means block presence at this particular site. In this particular case, it is only one block out of 578, and only this block has all its files at that site, but all other blocks are not there. Block presence can also reflect number of blocks using the same logic. But if this block at a site and only has 2 files (out of 9) then its block presence is less than 100%, to be precise 100*2/9 %.
Said that, thanks for your examples, I will try to accommodate them and clarify a little bit the wording.
thanks Valentin. Of course I do not question your arithmetic. But I suspect that "block presence" may be very clear to you (and maybe me) but it is a word for which everybody may assume something different when reading. Maybe something like
blocks at site: total 43/74, fully replicated 12/74
and if last number is equal to total(74), color it green There are ~infinite ways to write things down, simply make sure that you do not use definitions which are not clearly specified.
But what we really want to know is (fictitious example): site A: fully replicated 19/30 site B: fully replicates 14/30 So, in the end are all the 30 blocks on disk ? Or not ?
We can certainly say that that's too much to ask DAS, but that's what users want, and they do not care for all details.
Stefano, thanks for suggestion, but your example is still ambiguous. Let say we have this stats:
site A: fully replicated 2/3
site B: fully replicates 1/3
Does it mean that all 3 blocks are replicated, the answer it is not obvious because you must know which blocks are replicated to site A and site B. If we have 3 blocks, then it may be that blocks 1 and 2 are replicated to site A, then block1 to site B. In this example we have total sum 3 but the block1 appears on both sites while block 3 is not replicated. What we need is unambiguous explanation about blocks at site. and unless which know block interception we do not know if all of them are available.I think we need to show blocks at site
ratio for each site, and fully replicated
ratio for all sites. In this way we will know if all or not blocks are fully available across all sites.
I do not see any way around having a map of blocks to sites. And I surely agree that it is does not fit cleanly into DAS design, and possibly does not fit at all. Somehow one needs to get the full info and then parse it. But as you point out there are a lot of ambiguities otherwise. I suspect that your last suggestions will not do either. Another way would be to call rucio.list-dataset-replicas and count number of AVAILABLE ones.
@belforte provide useful feedback in https://its.cern.ch/jira/browse/CMSTRANSF-532
My 2c is to revisit the "sites" button in DAS whose output is almost useless when a dataset is hosted across more than one site, and add in there the information about the rule which is keeping those files on each site. A bit of work, but very useful. Now, if you want to know if a dataset is available on disk, you need to submit a CRAB task !
e.g. https://cmsweb.cern.ch/das/request?instance=prod/global&input=site+dataset%3D%2FParkingBPH1%2FRun2018A-20Jun2021_UL2018-v1%2FAOD is any block on disk ? where ? how many ? those are "old questions". Now we can indeed add "until when".
This is "so much needed" that I am considering writing a script myself around Rucio API. WHat I'd like is a table like: site | # of fully hosted blocks there | number of additional partially hosted blocks and than a table of number of blocks with 0, 1, 2, ... sites which hosts a complete replica
in the first table we can add ruleid and expiration (but there can be multiple rules making it a bit annoying to define the details)