Implement file dataset=/a/b/c site=XXX run=123 query using Rucio APIs

vkuznet commented 3 years ago

Originally the support for

file dataset=/a/b/c site=XXX run=123

query was done through DBS and Phedex APIs. First, we resolved list of blocks for a given dataset. Then, we find files for a given set of blocks and run number, and finally filter files using Phedex fileReplicas API to select files on a given site.

Now, we need to implement the same logic using DBS and Rucio APIs. The question is do we have similar to fileReplicas Rucio API to select files only for a given site or should we find another route in Rucio to accommodate this workflow.

@ericvaandering could you please comment on this?

ericvaandering commented 3 years ago

Almost. I think what you want to do is what this does: https://github.com/rucio/rucio/blob/0246888ceeb8cc12387c6aaffd398921b31da10e/lib/rucio/client/replicaclient.py#L117

You can pass either a container or a block and get all the file replicas, or if you pass an RSE it will give just data at that RSE.

Then you probably need to filter out what Rucio gives you for the files which matched the run in your example. Of course, you could query file by file or provide a list of files, but that may be less efficient or involve transferring more data.

The code shows you how to build the REST query.

vkuznet commented 3 years ago

Eric, I still need your assistance with this as I'm getting different errors from Rucio server. So if I correctly depict replicaclient.py codebase you pointed out I came up with the following plain curl call:

#!/bin/bash
opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=`curl $opt -v https://cms-rucio-auth.cern.ch/auth/x509 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"`
echo "$token"
dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO
curl $opt -H "X-Rucio-Auth-Token: $token" -X POST -d '{"dids": ["scope":"cms", "name":"$dataset"], "domain": "all"}' "http://cms-rucio.cern.ch/replicas/cms/list"

Here I tried two URLs: http://cms-rucio.cern.ch/replicas/cms/list which returns internal server error, but I'm not sure if /cms should be part of URL since it does not like the case from replicaclient.py code. So I tried w/o it, i.e. http://cms-rucio.cern.ch/replicas/list which gives me a different error {"ExceptionMessage": "Cannot decode json parameter list", "ExceptionClass": "ValueError"}.

So, as you know I really need plain URL example in order to proceed with this request. Please guide me as necessary.

ericvaandering commented 3 years ago

It’s list/replicas, not list/cms/replicas

I don’t know why exactly, but your JSON is throwing an error here:

https://github.com/rucio/rucio/blob/2ff6f17c7fda45524be8e644cb85c1ed568b0bcd/lib/rucio/web/rest/webpy/v1/replica.py#L345 https://github.com/rucio/rucio/blob/2ff6f17c7fda45524be8e644cb85c1ed568b0bcd/lib/rucio/web/rest/webpy/v1/replica.py#L345

In fact if I check your JSON, I get:

a = loads('{"dids": ["scope":"cms", "name":"/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO"], "domain": "all"}') Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Expecting , delimiter: line 1 column 18 (char 17) a = loads('{"dids": ["scope": "cms", "name": "/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO"], "domain": "all"}')
Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Expecting , delimiter: line 1 column 18 (char 17)

On Jan 20, 2021, at 1:36 PM, Valentin Kuznetsov notifications@github.com wrote:

Eric, I still need your assistance with this as I'm getting different errors from Rucio server. So if I correctly depict replicaclient.py codebase you pointed out I came up with the following plain curl call:

!/bin/bash

opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem" token=curl $opt -v https://cms-rucio-auth.cern.ch/auth/x509 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g" echo "$token" dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO curl $opt -H "X-Rucio-Auth-Token: $token" -X POST -d '{"dids": ["scope":"cms", "name":"$dataset"], "domain": "all"}' "http://cms-rucio.cern.ch/replicas/cms/list" Here I tried two URLs: http://cms-rucio.cern.ch/replicas/cms/list which returns internal server error, but I'm not sure if /cms should be part of URL since it does not like the case from replicaclient.py code. So I tried w/o it, i.e. http://cms-rucio.cern.ch/replicas/list which gives me a different error {"ExceptionMessage": "Cannot decode json parameter list", "ExceptionClass": "ValueError"}.

So, as you know I really need plain URL example in order to proceed with this request. Please guide me as necessary.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_das2go_issues_30-23issuecomment-2D763883603&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=mPJRZORsuklurZcOG_N0LaipVo0nQBBR5OXSTO6H6tA&s=Ety5LV7ga95YhISjLQWsmYskJWIyC141CzhFR6xUQjU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLRNXV7FDOZRZWDKLIDS24WCHANCNFSM4WHJSYOQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=mPJRZORsuklurZcOG_N0LaipVo0nQBBR5OXSTO6H6tA&s=RWNV1w6aREYlMhQl6RkoGkvaoL-QIfeJw2g_Qb4elqo&e=.

vkuznet commented 3 years ago

Eric, thanks for spotting json problem. I managed to get the output with the following sequence of steps:

#!/bin/bash
opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=`curl $opt -v https://cms-rucio-auth.cern.ch/auth/x509 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"`
echo "$token"
dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO
curl $opt -H "X-Rucio-Auth-Token: $token" -X POST -d '{"dids": [{"scope":"cms", "name":"/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO"}], "domain": "all", "rse_expression": "T2_DE_DESY"}' "http://cms-rucio.cern.ch/replicas/list"

The output looks like this now:

{"adler32": "df6675e0", "name": "/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/270001/BF52B44F-51A0-3248-B13A-9052DF7B03CA.root", "rses": {"T2_DE_DESY": []}, "bytes": 3736739040, "states": {"T2_DE_DESY": "AVAILABLE"}, "pfns": {}, "scope": "cms", "md5": null}
{"adler32": "07531d4b", "name": "/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/270001/BFBAA739-795D-AF49-ACFB-1B53033E7121.root", "rses": {"T2_DE_DESY": []}, "bytes": 3755039930, "states": {"T2_DE_DESY": "AVAILABLE"}, "pfns": {}, "scope": "cms", "md5": null}
...

which I hope would be sufficient for this use-case. I'll proceed with implementing necessary bits in DAS codebase.

vkuznet commented 3 years ago

Done. The new release on cmsweb is upgraded and new dasgoclient PR is here https://github.com/cms-sw/cmsdist/pull/6584

If you need a binary version of dasgoclient before it will be updated on cvmfs please take it from here: /afs/cern.ch/user/v/valya/public/dasgoclient/dasgoclient

The new version is

Build: git=v02.04.23 go=go1.15.6 date=2021-01-21 21:15:20.46625747 +0100 CET m=+0.006210747

and your query looks like this:

./dasgoclient -query="file dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO site=T2_DE_DESY run=316723"
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/25B4C3D5-03C1-F24E-9D35-E08860CBC145.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/4E29E31D-AA0E-8744-B558-98B35D8320E3.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/BAE93AF7-30F2-FC49-95FF-E584E4BE6773.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/FAF43300-D19D-E24E-9175-B800DBD5083C.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/70001/67CF1160-478F-5E4E-9F7D-57E8E09C1E25.root

Closing the issue.

dmwm / das2go

Implement file dataset=/a/b/c site=XXX run=123 query using Rucio APIs #30

!/bin/bash