Closed d-ylee closed 2 years ago
We have filesummaries DBS API, e.g.
https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries/?dataset=/ZMM_14TeV_TuneCUETP8M1_Pythia8/PhaseIITDRFall17GS-93X_upgrade2023_realistic_v2-v1/GEN-SIM&validFileOnly=1
[{"file_size":23103835881,"nblocks":5,"nevents":33000,"nfiles":7,"nlumis":271,"num_block":5,"num_event":33000,"num_file":7,"num_lumi":271}]
which partially implements this. Probably we can extend it to include other attributes, like dbs mcm id. What is unclear is what are date_dbs_max
, date_dbs_median
and lastupdate
. I think request came from @drkovalskyi . Dima, was it you? If so, could you please clarify the attributes I mentioned above. Once we know their meaning we can see if we can extend filesummaries API or provide another one.
Yes, that's what I'm extracting from DBS via file API. These are:
@drkovalskyi , I added additional fields to filesummaries
DBS API. Here is the change where I used my dev DBS instance:
# this is the output of current server
d=/ZMM/Summer11-DESIGN42_V11_428_SLHC1-v1/GEN-SIM
scurl -s "https://cmsweb-testbed.cern.ch/dbs2go/filesummaries?dataset=$d"
[
{"file_size":7840499449,"num_block":1,"num_event":10250,"num_file":7,"num_lumi":22}
]
here scurl
is alias to curl -L -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem
. And, once I applied the change https://github.com/dmwm/dbs2go/commit/cfaca2587f7d9389f993db8f35680964ae471d33 we have the following information:
scurl -s "https://cmsweb-testbed.cern.ch/dbs2go/filesummaries?dataset=$d"
[
{"file_size":7840499449,"max_ldate":1434361439,"median_cdate":1325266825,"median_ldate":1325267813,"num_block":1,"num_event":10250,"num_file":7,"num_lumi":22}
]
Could you please try out this API on https://cmsweb-testbed.cern.ch/dbs2go
for few datasets/blocks and tell me if it is sufficient.
As you can see from my commit I adjusted underlying DBS queries to include
max(f.last_modification_date)
median(f.creation_date)
median(f.last_modification_date)
where f
represents FILES
DBS table.Sorry for the delay. I will try to find time this week to try it out.
Dima, did you get time to look at this?
I'm using the DBS API python client and I don't see the new additional information. Does it require an update? Direct http access works. One question though. Is the aggregated information recomputed when files get invalidated? One issue that I had in the past that the summary information was inconsistent with what I was getting checking all entries (it was for blocks though)
Dima, I need more information. Which DBSClient version? Which DBS API did you use? Most likely, you have outdated version or use it incorrectly. What you need is the following:
# install new DBSClient from pypi
# https://pypi.org/project/dbs3-client/
pip install dbs3-client
# in your code you should use something similar
from dbs.apis.dbsClient import DbsApi
url=“https://.../dbs/int/global/DBSReader”
api = DbsApi(url=url,
useGzip=True,
accept=“application/ndjson”,
aggregate=False)
# example of how to use an API
data = {“block”: “/a/b/c#123”, … }
res = api.insertBulkBlock(data)
You need new client, you may use or skip accept
flag and depending on API you may or may not need to aggregate records, i.e. you can easily play with above flags in DbsApi object initialization. Since often you (T0) fetches lots of data usage of gzip is advised too. Finally, remember the DBSClient is just HTTP wrapper around HTTP calls which I showed you before. Therefore, it is not required per-se but I understand it may be easy to use from existing python scripts.
And, I provided the aggregation on ORACLE side, therefore it will be computed at run-time.
I have an old client, but I was expecting that it just returns whatever server provides. Anyway, will update the client tomorrow and let you know.
I'm having a hard time to use pypi latest dbs client. I use it with virtualenv. pip instals dbs-pycurl, which doesn't work with openssl:
pycurl.error: (35, 'Peer does not recognize and trust the CA that issued your certificate.')
Anyway, since this is a separate issue and at http level everything works, the aggregation is sufficient for our use case. It will take me some time to sort this authentication issue to start using it though.
Dima, most likely you are not using proper version of python. You need python3 and best way to get it is from CMSSW environment. Here is what I did on lxplus:
lxplus724(08:53:44) CMSSW > cd CMSSW_11_1_9/
lxplus724(08:53:49) CMSSW_11_1_9 > cmsenv
lxplus724(08:53:53) CMSSW_11_1_9 > pwd
/afs/cern.ch/user/v/valya/workspace/CMSSW/CMSSW_11_1_9
lxplus724(08:53:56) CMSSW_11_1_9 > type python
python is /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_1_9/external/slc7_amd64_gcc820/bin/python
lxplus724(08:54:01) CMSSW_11_1_9 > python -V
Python 2.7.15+
lxplus724(08:54:05) CMSSW_11_1_9 > python3 -V
Python 3.8.2
lxplus724(08:54:09) CMSSW_11_1_9 > python3 -m venv venv3
lxplus724(08:55:09) CMSSW_11_1_9 > source venv3/bin/activate
(venv3) lxplus724(08:55:33) CMSSW_11_1_9 > pip install dbs3-client
...
python3
Python 3.8.2 (default, May 7 2020, 20:12:14)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dbs
>>>
>>> from dbs.apis.dbsClient import DbsApi
>>> url="https://cmsweb.cern.ch/dbs/prod/global/DBSReader"
>>> api = DbsApi(url=url, useGzip=True)
>>> res = api.listDataTiers()
...
As you can see everything works just fine, but you must use python3 and all certs are properly installed if you'll use any CMSSW release environment.
Anyway, should I put this change into production DBS such that you'll start using this info? I understand that it may take a while for you to adapt to new output from this aggregated info but I want to understand if anything else needs to be done from my side. Please let me know and I can apply these changes.
Thanks Valentin. Indeed, CMSSW gives a good environment with a proper version of pycurl. We are using the system default python3 and avoiding CMSSW environment due to some incompatibilities. Will sort it out.
With CMSSW environment I get proper output and everything seems to be right. Thanks
@drkovalskyi , I'm awaiting confirmation from WMCore team from testbed if they will give me green line I'll update DBS on production next week. @amaltaro, @todor-ivanov did you experience any misbehavior on DBS testbed? If not, are you ok that I can proceed with DBS upgrade on production nodes next week?
Now, the code is deployed on production k8s clusters and should be fully available. I'm closing this issue as resolved. @drkovalskyi please start migration process to new API output and I hope we can significantly reduce number of queries you usually do against DBS to fetch necessary information to build aggregated stats.
After observing high usage for the DBS APIs, specifically datasets and files, a discussion with the end user was conducted. Through the discussion, it was discovered that the user would use the DBSClient to query datasets to check if a
dataset
exists. The user would also query files in order to aggregateis_file_valid
,file_size
,event_count
, andlast_modification_date
and store it to their own database.User DBSClient Call Examples
dbs_api.listDatasets(dataset=dataset_name,detail=1,dataset_access_type="*")
dbs_api.listFiles(dataset=name,detail=True)
User Database Schema for Aggregated Data
The user describes using files API to do this instead of blocks since block level information was inconsistent during development.
To be discussed Should such information be aggregated server side into a separate table?