dmwm / das2go

Go implementation of Data Aggregation System (DAS) for CMS experiment
MIT License
2 stars 2 forks source link

Provide rucio rules information in DAS #54

Open vkuznet opened 1 year ago

vkuznet commented 1 year ago

@belforte provide useful feedback in https://its.cern.ch/jira/browse/CMSTRANSF-532

My 2c is to revisit the "sites" button in DAS whose output is almost useless when a dataset is hosted across more than one site, and add in there the information about the rule which is keeping those files on each site. A bit of work, but very useful. Now, if you want to know if a dataset is available on disk, you need to submit a CRAB task !

e.g. https://cmsweb.cern.ch/das/request?instance=prod/global&input=site+dataset%3D%2FParkingBPH1%2FRun2018A-20Jun2021_UL2018-v1%2FAOD is any block on disk ? where ? how many ? those are "old questions". Now we can indeed add "until when".

This is "so much needed" that I am considering writing a script myself around Rucio API. WHat I'd like is a table like: site | # of fully hosted blocks there | number of additional partially hosted blocks and than a table of number of blocks with 0, 1, 2, ... sites which hosts a complete replica

in the first table we can add ruleid and expiration (but there can be multiple rules making it a bit annoying to define the details)

belforte commented 1 year ago

We can limp along for some time with a script which does this, e.g. [1] It has the advantage that I can e.g. filter site list by which sites are available to CRAB and play with what-to-look-for-and-to-show. Once we can define some clear format, we can reconsider adding to DAS.

[1] https://github.com/dmwm/CRABServer/blob/master/scripts/Utils/CheckDiskAvailability.py example:

belforte@lxplus704/~> python3 CheckDiskAvailability.py --dataset /ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
Checking disk availability of dataset: /ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
 only blocks fully replicated are listed 
  block: 89
dataset has 89 blocks

   0 blocks have  0 disk replicas
  12 blocks have  1 disk replicas
  63 blocks have  2 disk replicas
  14 blocks have  3 disk replicas

 Site location
 T2_BR_SPRACE    hosts  28 blocks
 T2_IT_Rome      hosts  41 blocks
 T1_IT_CNAF_Disk hosts  34 blocks
 T2_CH_CSCS      hosts  60 blocks
 T1_RU_JINR_Disk hosts   5 blocks
 T2_PL_Swierk    hosts   8 blocks
 T2_RU_JINR      hosts   1 blocks
 T2_CN_Beijing   hosts   1 blocks
 T2_TR_METU      hosts   1 blocks
 T2_IT_Pisa      hosts   1 blocks

belforte@lxplus704/~> 
vkuznet commented 1 year ago

@belforte , I made small adjustment to DAS code and now it can show number of blocks and files per site. The UI will look like this Screen Shot 2023-03-22 at Mar 22, 8 27 46 AM

and, CLI output in json format will have corresponding attributes, e.g.

d=/ParkingBPH1/Run2018A-20Jun2021_UL2018-v1/AOD
dasgoclient -query="site dataset=$d" -json
...
    "site": [
      {
        "block_completion": "31.46%",
        "block_fraction": "100.00%",
        "dataset_fraction": " 0.00%",
        "kind": "DISK",
        "name": "T2_BR_SPRACE",
        "nblocks": 28,
        "nfiles": 5021,
        "replica_fraction": "100.00%",
        "se": "T2_BR_SPRACE"
      }

Does it enough to cover this use-case? So far I did not put effort to present X blocks have Y disk replicas as shown in your python script since it will require more coding and I do not know if it is relevant for end-users.

belforte commented 1 year ago

looks good to me, though I'd prefer to show number of blocks as fraction of the total like e.g.89/194, 28/194 etc. The important thing is to make sure that you count and show fully replicated blocks. Maybe also write Number of complete blocks. Unfortunately nblocks leads to ambiguity.

vkuznet commented 1 year ago

ok, I can do what you ask:=, e.g. Screen Shot 2023-03-22 at Mar 22, 12 15 07 PM

But I need further clarification what is definition of complete block vs fully replicated blocks? To me, fully replicated means that all files from that block are at a site, e.g. if block has X files, all X files are at that site. But what is complete means in this case?

And, I also assume we are talking about valid files, since block may have invalid files too. So, the fully replicated means actually that all valid files are replicated to that site, right?

belforte commented 1 year ago

sorry, I used complete to mean fully replicated :-( . To be precise the definition is that the the Rucio dataset (aka block here) has state AVAILABLE. From what I know so far, that may contain invalid files as well, i.e. w/o replicas (lost files e.g.). I do not know how exactly Rucio behaves when files are invalidated. Good question !

vkuznet commented 1 year ago

@belforte , new changes are deployed to cmsweb-testbed DAS server. Feel free to use it and provide me a feedback over here. Then, I can deploy it to production.

belforte commented 1 year ago

Can you make it clear that number of blocks used in "block presence" is not the same as reported in second row as "number of blocks" ? IIUC in the end you print the same as "block presence" in the line above, simply as fraction instead of percentual.

Side note: the dataset in the original example is not in cmsweb-testbed (int instance of DBS), so I looked up https://cmsweb-testbed.cern.ch/das/request?instance=int/global&input=site+dataset%3D%2FParkingBPH1%2FRun2018D-05May2019promptD-v1%2FAOD and am curious about the report for CERN_Tape number of files number of files 100943/100955 since all blocks are fully replicated there, how can some files be missing ? Where do those two numbers come from ? Invalid files ? Files w/o a replica ? bug ?

vkuznet commented 1 year ago

Stefano, I'm not sure I understood your first part of the reply, please rephrase it differently, i.e. just show how you will present this info.

For the second part, in testbed it shows total number of files in dataset rather than valid ones. We need to agree of what to use, should we report total number of valid files or total files in a dataset.

belforte commented 1 year ago

I'd change

Number of blocks 1/578 number of files 9/100955 

to

Fully replicated blocks: 1/578  File replicas (only valid files): 9/100955 

What always confuse people is the block presence: number of blocks at the site / number of blocks in the dataset which is often 100% . This number of blocks at the site is not a well-known, well-defined concept. E.g. look at this (from this DAS page ) image

it that RSE has only 9 files, how can Block presence be 100 % ??

vkuznet commented 1 year ago

Well, a dataset may have 578 blocks, then 1 block may have 9 files, and other blocks will have the rest of the files. If the first block is at a site and all its files are there the block presence is 100%. The block presence means block presence at this particular site. In this particular case, it is only one block out of 578, and only this block has all its files at that site, but all other blocks are not there. Block presence can also reflect number of blocks using the same logic. But if this block at a site and only has 2 files (out of 9) then its block presence is less than 100%, to be precise 100*2/9 %.

Said that, thanks for your examples, I will try to accommodate them and clarify a little bit the wording.

belforte commented 1 year ago

thanks Valentin. Of course I do not question your arithmetic. But I suspect that "block presence" may be very clear to you (and maybe me) but it is a word for which everybody may assume something different when reading. Maybe something like

blocks at site: total 43/74, fully replicated 12/74

and if last number is equal to total(74), color it green There are ~infinite ways to write things down, simply make sure that you do not use definitions which are not clearly specified.

But what we really want to know is (fictitious example): site A: fully replicated 19/30 site B: fully replicates 14/30 So, in the end are all the 30 blocks on disk ? Or not ?

We can certainly say that that's too much to ask DAS, but that's what users want, and they do not care for all details.

vkuznet commented 1 year ago

Stefano, thanks for suggestion, but your example is still ambiguous. Let say we have this stats:

site A: fully replicated 2/3
site B: fully replicates 1/3

Does it mean that all 3 blocks are replicated, the answer it is not obvious because you must know which blocks are replicated to site A and site B. If we have 3 blocks, then it may be that blocks 1 and 2 are replicated to site A, then block1 to site B. In this example we have total sum 3 but the block1 appears on both sites while block 3 is not replicated. What we need is unambiguous explanation about blocks at site. and unless which know block interception we do not know if all of them are available.I think we need to show blocks at site ratio for each site, and fully replicated ratio for all sites. In this way we will know if all or not blocks are fully available across all sites.

belforte commented 1 year ago

I do not see any way around having a map of blocks to sites. And I surely agree that it is does not fit cleanly into DAS design, and possibly does not fit at all. Somehow one needs to get the full info and then parse it. But as you point out there are a lot of ambiguities otherwise. I suspect that your last suggestions will not do either. Another way would be to call rucio.list-dataset-replicas and count number of AVAILABLE ones.