DUNE / data-mgmt-ops

3 stars 3 forks source link

Request for instructions on finding out what datasets reside at a given RSE #311

Open hschellman opened 1 year ago

hschellman commented 1 year ago

How do I get the big picture of what datasets are at an RSE?

What I want/need os a full example of how a novice user would make that query placed, possibly as a markdown file in this repository that we can use.

Things to include:

Where can you run the commands? What to you need to set up?

What environmental need to be set for authentication/access

What commands need to be run to do the query?

How do I find out what a dataset means - is there a mapping to metacat datasets?

Long run we likely write some simple tools to do this.

dougbenjamin commented 1 year ago

I presume that you will be using both Rucio and Metacat. Is this correct? Since all files in Rucio must be in Metacat does that mean that all datasets in Rucio must be in Metacat?

dougbenjamin commented 1 year ago

Are the user's: site-administrators, DUNE analyzers? Given an RSE name, rucio client commands can get the list of Rucio datasets at a given RSE and from these result get the number of files - `[dunepro@dunesl7gpvm01 ~]$ rucio list-dataset-replicas --deep dc4-vd-coldbox-bottom:dc4-vd-coldbox-bottom_307151901

DATASET: dc4-vd-coldbox-bottom:dc4-vd-coldbox-bottom_307151901 +-------------------------+---------+---------+ | RSE | FOUND | TOTAL | |-------------------------+---------+---------| | PRAGUE | 60 | 60 | | DUNE_ES_PIC | 60 | 60 | | DUNE_CERN_EOS | 60 | 60 | | DUNE_US_FNAL_DISK_STAGE | 60 | 60 | | MANCHESTER | 60 | 60 | | DUNE_US_BNL_SDCC | 60 | 60 | +-------------------------+---------+---------+ `

hschellman commented 1 year ago

I'm thinking the data sample manager, a physicist who uses the system to place samples as directed by physics groups.

Think of me as the first person. I need to have instructions to log in to the right machine, set up rucio and metacat and then the commands to issue to figure out what datasets are in play nd what datasets are at a given sight.

Assume a novice rucio user with no prior knowledge.

dougbenjamin commented 1 year ago

Will this person move files around? If you will write the metacat commands to get the database name, then I will write the Rucio part. I don’t use metacat much.

Sent from my iPhone

On May 11, 2023, at 3:59 PM, Heidi Schellman @.***> wrote:



I'm thinking the data sample manager, a physicist who uses the system to place samples as directed by physics groups.

Think of me as the first person. I need to have instructions to log in to the right machine, set up rucio and metacat and then the commands to issue to figure out what datasets are in play nd what datasets are at a given sight.

Assume a novice rucio user with no prior knowledge.

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544672007, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABA5LK4JMSQUCW5T55ZXNRDXFVHL7ANCNFSM6AAAAAAX44K6UM. You are receiving this because you were assigned.Message ID: @.***>

hschellman commented 1 year ago

In the long run yes. Short run, I need help with basic rucio use!!!!! That is why the request for very detailed instructions

On May 11, 2023, at 19:19, Doug Benjamin @.***> wrote:



[This email originated from outside of OSU. Use caution with links and attachments.]

Will this person move files around? If you will write the metacat commands to get the database name, then I will write the Rucio part. I don’t use metacat much.

Sent from my iPhone

On May 11, 2023, at 3:59 PM, Heidi Schellman @.***> wrote:



I'm thinking the data sample manager, a physicist who uses the system to place samples as directed by physics groups.

Think of me as the first person. I need to have instructions to log in to the right machine, set up rucio and metacat and then the commands to issue to figure out what datasets are in play nd what datasets are at a given sight.

Assume a novice rucio user with no prior knowledge.

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544672007, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABA5LK4JMSQUCW5T55ZXNRDXFVHL7ANCNFSM6AAAAAAX44K6UM. You are receiving this because you were assigned.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544820066, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DP3XBM5VQ6BHCUXN3DXFVXWRANCNFSM6AAAAAAX44K6UM. You are receiving this because you authored the thread.Message ID: @.***>

dougbenjamin commented 1 year ago

Try these commands. Are they detailed enough?

-bash-4.2$ source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh

Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org

Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/

-bash-4.2$ setup rucio

-bash-4.2$ kx509

-bash-4.2$ voms-proxy-info -all subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin issuer : /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon Silver CA 1 identity : /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon Silver CA 1 type : unknown strength : 2048 bits path : /tmp/x509up_u1284 timeleft : 167:59:52 key usage : Digital Signature, Key Encipherment, Data Encipherment

-bash-4.2$ export ROLE=Analysis -bash-4.2$ voms-proxy-init -rfc -noregen -bits 2048 -voms=dune:/dune/Role=$ROLE --valid 120:00 Your identity: /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Research Alliance/CN=voms1.fnal.gov] "dune" Done Creating proxy ....................................... Done

Your proxy is valid until Wed May 17 06:40:51 2023

-bash-4.2$ voms-proxy-info -all subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin/CN=3643050385 issuer : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin identity : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin type : RFC compliant proxy strength : 2048 bits path : /tmp/x509up_u1284 timeleft : 119:59:55 key usage : Digital Signature, Key Encipherment, Data Encipherment === VO dune extension information === VO : dune subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Doug Benjamin/CN=UID:benjamin issuer : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Research Alliance/CN=voms1.fnal.gov attribute : /dune/Role=Analysis/Capability=NULL attribute : /dune/Role=NULL/Capability=NULL timeleft : 119:59:55 uri : voms1.fnal.gov:15042

-bash-4.2$ export RUCIO_ACCOUNT=benjamin

-- Check to see if you can authenticate against the Rucio server

-bash-4.2$ rucio whoami created_at : 2021-10-19T20:01:12 account : benjamin status : ACTIVE email : None deleted_at : None updated_at : 2021-10-19T20:01:12 account_type : USER suspended_at : None

-- Check which RSE’s the dataset found using the scope and dataset discovered from a Metacat query – for example - dc4-vd-coldbox-bottom:dc4-vd-coldbox-bottom_307151901

-bash-4.2$ rucio list-dataset-replicas --deep dc4-vd-coldbox-bottom:dc4-vd-coldbox-bottom_307151901

DATASET: dc4-vd-coldbox-bottom:dc4-vd-coldbox-bottom_307151901 +-------------------------+---------+---------+ | RSE | FOUND | TOTAL | |-------------------------+---------+---------| | PRAGUE | 60 | 60 | | DUNE_ES_PIC | 60 | 60 | | DUNE_CERN_EOS | 60 | 60 | | DUNE_US_FNAL_DISK_STAGE | 60 | 60 | | MANCHESTER | 60 | 60 | | DUNE_US_BNL_SDCC | 60 | 60 | +-------------------------+---------+---------+

From: Heidi Schellman @.> Reply-To: DUNE/data-mgmt-ops @.> Date: Thursday, May 11, 2023 at 8:30 PM To: DUNE/data-mgmt-ops @.> Cc: Doug Benjamin @.>, Assign @.***> Subject: Re: [DUNE/data-mgmt-ops] Request for instructions on finding out what datasets reside at a given RSE (Issue #311)

In the long run yes. Short run, I need help with basic rucio use!!!!! That is why the request for very detailed instructions

On May 11, 2023, at 19:19, Doug Benjamin @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Will this person move files around? If you will write the metacat commands to get the database name, then I will write the Rucio part. I don’t use metacat much.

Sent from my iPhone

On May 11, 2023, at 3:59 PM, Heidi Schellman @.***> wrote:

I'm thinking the data sample manager, a physicist who uses the system to place samples as directed by physics groups.

Think of me as the first person. I need to have instructions to log in to the right machine, set up rucio and metacat and then the commands to issue to figure out what datasets are in play nd what datasets are at a given sight.

Assume a novice rucio user with no prior knowledge.

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544672007, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABA5LK4JMSQUCW5T55ZXNRDXFVHL7ANCNFSM6AAAAAAX44K6UM. You are receiving this because you were assigned.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544820066, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DP3XBM5VQ6BHCUXN3DXFVXWRANCNFSM6AAAAAAX44K6UM. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1544933309, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABA5LK5XOUSSTFNNODPWGPDXFWABTANCNFSM6AAAAAAX44K6UM. You are receiving this because you were assigned.Message ID: @.***>

hschellman commented 1 year ago

I will make a page with those instructions.

So the last question is to see what rucio datasets are stored at a given RSE. Ie, what do we have at PIC?

hschellman commented 1 year ago

https://github.com/DUNE/data-mgmt-ops/wiki/Rucio-for-Beginners is what you sent. So can we add the what is at site X and then close?

StevenCTimm commented 1 year ago

OK so I did not have time to get into this thread while I was at CHEP... there are several key issues not yet mentioned above. (0) I propose that we refer to physics-level completed results not as "datasets" but as "data collections". rucio, Metacat, and SAM all define dataset differently and if we don't have a different unified term we will tie ourselves up in circles talking about this.

(1) We (in person of summer student Nicole Avila) already wrote a monitoring utility to do this, although it has to be updated to python3 (2) The formally released data collections such as would appear on the dune-data.fnal.gov page were called "datasets" in the past but in Rucio those data collections will map to containers of many data sets. Most of the rules that we make to fan out data collections from Fermilab to elsewhere are done at container level, not at rucio dataset level which are organized run by run. (3) It is likely the case that we have a lot of container-based-rules that makes rucio list-datasets-rse not give quite the expected output although it does give enough output to make the monitoring plot above. (4) At the moment the plan has been that not every single run-level data set in Rucio would have an entry in metacat. There are none that do at the moment. The "data collections" however should be declared as a dataset in Metacat, with metadata as appropriate, and be linked to a container name in Rucio.

hschellman commented 1 year ago

Can someone come up with a diagram that shows the relation between containers and datasets in rucio?

My main questions I was trying to answer was - what data is at site X. And what sites contain which data collections. The what is at site X came up because one of the site managers noted that we had filled disk but not used it and I was curious to see if we had ancient data there which we should replace.

On May 15, 2023, at 6:28 AM, Steven Timm @.**@.>> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

OK so I did not have time to get into this thread while I was at CHEP... there are several key issues not yet mentioned above. (0) I propose that we refer to physics-level completed results not as "datasets" but as "data collections". rucio, Metacat, and SAM all define dataset differently and if we don't have a different unified term we will tie ourselves up in circles talking about this.

(1) We (in person of summer student Nicole Avila) already wrote a monitoring utility to do this, although it has to be updated to python3 (2) The formally released data collections such as would appear on the dune-data.fnal.govhttp://dune-data.fnal.gov/ page were called "datasets" in the past but in Rucio those data collections will map to containers of many data sets. Most of the rules that we make to fan out data collections from Fermilab to elsewhere are done at container level, not at rucio dataset level which are organized run by run. (3) It is likely the case that we have a lot of container-based-rules that makes rucio list-datasets-rse not give quite the expected output although it does give enough output to make the monitoring plot above. (4) At the moment the plan has been that not every single run-level data set in Rucio would have an entry in metacat. There are none that do at the moment. The "data collections" however should be declared as a dataset in Metacat, with metadata as appropriate, and be linked to a container name in Rucio.

— Reply to this email directly, view it on GitHubhttps://github.com/DUNE/data-mgmt-ops/issues/311#issuecomment-1547861341, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA37DI4GWRHNMYK2WAV5NTXGIVOVANCNFSM6AAAAAAX44K6UM. You are receiving this because you authored the thread.Message ID: @.***>

StevenCTimm commented 1 year ago

yes I can come up with a diagram, and when my student gets here in a couple weeks I can have him revive Nicoles' plot.