DBS Dataset Aggregation

d-ylee commented 2 years ago

After observing high usage for the DBS APIs, specifically datasets and files, a discussion with the end user was conducted. Through the discussion, it was discovered that the user would use the DBSClient to query datasets to check if a dataset exists. The user would also query files in order to aggregate is_file_valid, file_size, event_count, and last_modification_date and store it to their own database.

User DBSClient Call Examples dbs_api.listDatasets(dataset=dataset_name,detail=1,dataset_access_type="*") dbs_api.listFiles(dataset=name,detail=True)

User Database Schema for Aggregated Data

date_dbs_max
date_dbs_median
nevents
size
status
dbs_campaign
dbs_mcm_id
lastupdate

The user describes using files API to do this instead of blocks since block level information was inconsistent during development.

To be discussed Should such information be aggregated server side into a separate table?

vkuznet commented 2 years ago

We have filesummaries DBS API, e.g.

https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries/?dataset=/ZMM_14TeV_TuneCUETP8M1_Pythia8/PhaseIITDRFall17GS-93X_upgrade2023_realistic_v2-v1/GEN-SIM&validFileOnly=1

[{"file_size":23103835881,"nblocks":5,"nevents":33000,"nfiles":7,"nlumis":271,"num_block":5,"num_event":33000,"num_file":7,"num_lumi":271}]

which partially implements this. Probably we can extend it to include other attributes, like dbs mcm id. What is unclear is what are date_dbs_max, date_dbs_median and lastupdate. I think request came from @drkovalskyi . Dima, was it you? If so, could you please clarify the attributes I mentioned above. Once we know their meaning we can see if we can extend filesummaries API or provide another one.

drkovalskyi commented 2 years ago

Yes, that's what I'm extracting from DBS via file API. These are:

date_dbs_max - max of file modification time. it's basically time when last chunk of data was added
date_dbs_median - median of file modification or creation time. Needed to assess when half of the files were created to assess processing time dynamics.
lastupdate - it's used for database synchronization. If you implement the fields via views or some other automatic procedure it's not really needed.

vkuznet commented 2 years ago

@drkovalskyi , I added additional fields to filesummaries DBS API. Here is the change where I used my dev DBS instance:

# this is the output of current server
d=/ZMM/Summer11-DESIGN42_V11_428_SLHC1-v1/GEN-SIM
scurl -s "https://cmsweb-testbed.cern.ch/dbs2go/filesummaries?dataset=$d"
[
{"file_size":7840499449,"num_block":1,"num_event":10250,"num_file":7,"num_lumi":22}
]

here scurl is alias to curl -L -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem. And, once I applied the change https://github.com/dmwm/dbs2go/commit/cfaca2587f7d9389f993db8f35680964ae471d33 we have the following information:

scurl -s "https://cmsweb-testbed.cern.ch/dbs2go/filesummaries?dataset=$d"
[
{"file_size":7840499449,"max_ldate":1434361439,"median_cdate":1325266825,"median_ldate":1325267813,"num_block":1,"num_event":10250,"num_file":7,"num_lumi":22}
]

Could you please try out this API on https://cmsweb-testbed.cern.ch/dbs2go for few datasets/blocks and tell me if it is sufficient.

As you can see from my commit I adjusted underlying DBS queries to include

max(f.last_modification_date)
median(f.creation_date)
median(f.last_modification_date) where f represents FILES DBS table.

drkovalskyi commented 2 years ago

Sorry for the delay. I will try to find time this week to try it out.

vkuznet commented 2 years ago

Dima, did you get time to look at this?

drkovalskyi commented 2 years ago

I'm using the DBS API python client and I don't see the new additional information. Does it require an update? Direct http access works. One question though. Is the aggregated information recomputed when files get invalidated? One issue that I had in the past that the summary information was inconsistent with what I was getting checking all entries (it was for blocks though)

vkuznet commented 2 years ago

Dima, I need more information. Which DBSClient version? Which DBS API did you use? Most likely, you have outdated version or use it incorrectly. What you need is the following:

# install new DBSClient from pypi
# https://pypi.org/project/dbs3-client/
pip install dbs3-client

# in your code you should use something similar
from dbs.apis.dbsClient import DbsApi
url=“https://.../dbs/int/global/DBSReader”
api = DbsApi(url=url,
             useGzip=True,
             accept=“application/ndjson”,
             aggregate=False)
# example of how to use an API
data = {“block”: “/a/b/c#123”, … }
res = api.insertBulkBlock(data)

You need new client, you may use or skip accept flag and depending on API you may or may not need to aggregate records, i.e. you can easily play with above flags in DbsApi object initialization. Since often you (T0) fetches lots of data usage of gzip is advised too. Finally, remember the DBSClient is just HTTP wrapper around HTTP calls which I showed you before. Therefore, it is not required per-se but I understand it may be easy to use from existing python scripts.

And, I provided the aggregation on ORACLE side, therefore it will be computed at run-time.

drkovalskyi commented 2 years ago

I have an old client, but I was expecting that it just returns whatever server provides. Anyway, will update the client tomorrow and let you know.

drkovalskyi commented 2 years ago

I'm having a hard time to use pypi latest dbs client. I use it with virtualenv. pip instals dbs-pycurl, which doesn't work with openssl:

pycurl.error: (35, 'Peer does not recognize and trust the CA that issued your certificate.')

Anyway, since this is a separate issue and at http level everything works, the aggregation is sufficient for our use case. It will take me some time to sort this authentication issue to start using it though.

vkuznet commented 2 years ago

Dima, most likely you are not using proper version of python. You need python3 and best way to get it is from CMSSW environment. Here is what I did on lxplus:

lxplus724(08:53:44) CMSSW > cd CMSSW_11_1_9/
lxplus724(08:53:49) CMSSW_11_1_9 > cmsenv
lxplus724(08:53:53) CMSSW_11_1_9 > pwd
/afs/cern.ch/user/v/valya/workspace/CMSSW/CMSSW_11_1_9
lxplus724(08:53:56) CMSSW_11_1_9 > type python
python is /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_1_9/external/slc7_amd64_gcc820/bin/python
lxplus724(08:54:01) CMSSW_11_1_9 > python -V
Python 2.7.15+
lxplus724(08:54:05) CMSSW_11_1_9 > python3 -V
Python 3.8.2
lxplus724(08:54:09) CMSSW_11_1_9 > python3 -m venv venv3
lxplus724(08:55:09) CMSSW_11_1_9 > source venv3/bin/activate
(venv3) lxplus724(08:55:33) CMSSW_11_1_9 > pip install dbs3-client
...

python3
Python 3.8.2 (default, May  7 2020, 20:12:14)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dbs
>>>
>>> from dbs.apis.dbsClient import DbsApi
>>> url="https://cmsweb.cern.ch/dbs/prod/global/DBSReader"
>>> api = DbsApi(url=url, useGzip=True)
>>> res = api.listDataTiers()
...

As you can see everything works just fine, but you must use python3 and all certs are properly installed if you'll use any CMSSW release environment.

Anyway, should I put this change into production DBS such that you'll start using this info? I understand that it may take a while for you to adapt to new output from this aggregated info but I want to understand if anything else needs to be done from my side. Please let me know and I can apply these changes.

drkovalskyi commented 2 years ago

Thanks Valentin. Indeed, CMSSW gives a good environment with a proper version of pycurl. We are using the system default python3 and avoiding CMSSW environment due to some incompatibilities. Will sort it out.

With CMSSW environment I get proper output and everything seems to be right. Thanks

vkuznet commented 2 years ago

@drkovalskyi , I'm awaiting confirmation from WMCore team from testbed if they will give me green line I'll update DBS on production next week. @amaltaro, @todor-ivanov did you experience any misbehavior on DBS testbed? If not, are you ok that I can proceed with DBS upgrade on production nodes next week?

vkuznet commented 2 years ago

Now, the code is deployed on production k8s clusters and should be fully available. I'm closing this issue as resolved. @drkovalskyi please start migration process to new API output and I hope we can significantly reduce number of queries you usually do against DBS to fetch necessary information to build aggregated stats.

dmwm / dbs2go

DBS Dataset Aggregation #40