avoid retrieving FMD for a task in one go from client

belforte commented 2 years ago

need to take care of FMD retrieval by CRAB Client during crab report https://github.com/dmwm/CRABServer/blob/1a823b40b15bda1a84b5a9d002c6278bc58c85eb/src/python/CRABInterface/HTCondorDataWorkflow.py#L151-L155

belforte commented 2 years ago

I do not see any easy solution.

One way would be to recode https://github.com/dmwm/CRABServer/blob/1a823b40b15bda1a84b5a9d002c6278bc58c85eb/src/python/CRABInterface/HTCondorDataWorkflow.py#L125 not to do this internal API call and rather leave it to CRABClient to get FIleMetaData a bit at a time. This requires a Client change first (e.g. check if FMD is already in report, if not, pull 10 files at a time.. for large tasks will take forever... should find a way to retrieve via a POST so can send a longer list every time, like 100).
Another, if somehow we learn how to return a zipped object in the json (so that do not have to recode the basic API as well), since inside HTCondorDataWorkflow the full metadata is anyhow retrieved before passing it to the client, that may be good enough.
Or simply suffer through this since "crab report" is seldom used.

No idea looks really appealing.

Maybe put a brute force cut at the root: if a file has more lumis than events, do not pass around lumi info. So FileMetaData never get too large, maybe only do that for (user) MC ?

belforte commented 2 years ago

another possibility:

recode https://github.com/dmwm/CRABServer/blob/1a823b40b15bda1a84b5a9d002c6278bc58c85eb/src/python/CRABInterface/HTCondorDataWorkflow.py#L125 to still fetch everything from FMD, but compress the JSON object before sending it to the client. E.g. as suggested in https://medium.com/@busybus/zipjson-3ed15f8ea85d

This of course requires change in the client side as well and could be the change to get back to having again ?subresource=report instead of ?subresource=report2 in the URL See https://mattermost.web.cern.ch/cms-o-and-c/pl/4phkf68sxbf9zyipzb3i9399xo for an initial discussion.

mapellidario commented 2 years ago

Actually, while browsing the WMCore code, i discovered that WMCore/REST already supports compression, but only zilib with level 9 [1], which is not the best 2.

How to use it: just add the header "Accept-Encoding: deflate" to the request.

Example with a simple request [3] and with a long request [4], which shows that the current compression does not really help us. So, if we want to pursue the compression route, we may need to push some changes upstream. We will not need to implement it from scratch, we can add zstd in parallel to the current zlib, but it will take some time nonetheless.

This is not a suggestion, but just something that I found that may be worth keeping in mind.

[1] https://github.com/dmwm/WMCore/blob/5e654f59a28cb95eb0759272f49ad4c7a33d731c/src/python/WMCore/REST/Server.py#L454

[3]:

220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94: task with 1 job

```plaintext > time curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94&filetype=EDM" HTTP/1.1 200 OK Date: Mon, 23 May 2022 18:06:30 GMT Server: CherryPy/17.4.0 Set-Cookie: cms-auth=afc255ee3015de4696eb881f180927e659bcdf38942556aa3da974d7c349a9f707d938f1d5f3933f;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT Content-Type: application/json Vary: Accept Cache-Control: max-age=3600 X-Rest-Status: 100 Etag: "414c8b8f9405761f3797ccad5f84d814d55913ac" Content-Length: 1081 X-Rest-Time: 5401.134 us CMS-Server-Time: D=46307 t=1653329190132496 {"result": [ "{\"taskname\": \"220523_101151:cmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94\", \"filetype\": \"EDM\", \"jobid\": \"1\", \"outdataset\": \"/CRAB_PrivateMC/cmsbot-crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa/USER\", \"acquisitionera\": \"null\", \"swversion\": \"CMSSW_12_4_X_2022-05-23-1100\", \"inevents\": 10, \"globaltag\": \"None\", \"publishname\": \"crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa\", \"location\": \"T2_CH_CERN\", \"tmplocation\": \"T2_CH_CERN\", \"runlumi\": {\"1\": {\"1\": \"10\"}}, \"adler32\": \"a9123644\", \"cksum\": 3177001619, \"md5\": \"asda\", \"lfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\", \"filesize\": 1848757, \"parents\": [], \"state\": null, \"created\": \"[]\", \"tmplfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\"}" ]} curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY 0.02s user 0.01s system 33% cpu 0.083 total > ``` ```plaintext > time curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Accept-Encoding: deflate" --compressed "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94&filetype=EDM" HTTP/1.1 200 OK Date: Mon, 23 May 2022 18:07:18 GMT Server: CherryPy/17.4.0 Set-Cookie: cms-auth=b7624ebbbdec1073f0da7092da9b31f4283cf9cdb04a3e815ef6e3a8e92f24af266b72136de3f183;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT Content-Type: application/json Vary: Accept,Accept-Encoding Cache-Control: max-age=3600 X-Rest-Status: 100 Content-Encoding: deflate Etag: "414c8b8f9405761f3797ccad5f84d814d55913ac" Content-Length: 429 X-Rest-Time: 6492.138 us CMS-Server-Time: D=52472 t=1653329238293634 {"result": [ "{\"taskname\": \"220523_101151:cmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94\", \"filetype\": \"EDM\", \"jobid\": \"1\", \"outdataset\": \"/CRAB_PrivateMC/cmsbot-crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa/USER\", \"acquisitionera\": \"null\", \"swversion\": \"CMSSW_12_4_X_2022-05-23-1100\", \"inevents\": 10, \"globaltag\": \"None\", \"publishname\": \"crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa\", \"location\": \"T2_CH_CERN\", \"tmplocation\": \"T2_CH_CERN\", \"runlumi\": {\"1\": {\"1\": \"10\"}}, \"adler32\": \"a9123644\", \"cksum\": 3177001619, \"md5\": \"asda\", \"lfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\", \"filesize\": 1848757, \"parents\": [], \"state\": null, \"created\": \"[]\", \"tmplfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\"}" ]} curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY -H --compressed 0.01s user 0.02s system 30% cpu 0.102 total > ```

[4]:

220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022: task with many jobs, 168MB filemetadata.

```plaintext > date ; curl --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022&filetype=EDM" >> /dev/null ; date Mon May 23 19:45:20 CEST 2022 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 168M 0 168M 0 0 2779k 0 --:--:-- 0:01:02 --:--:-- 2796k Mon May 23 19:46:22 CEST 2022 > ``` notice that the compressed output is 12% of the original, but it took ~40s to compress (in the first 40s curl did not report any incoming data), and that curl reports the speed as averaged on the whole request time (215k*100s = 21MB, despite only 60s were requires for the transfer itself). These compression rate and time necessary for compression may be improved using a different compression algorithm. ```plaintext > date ; curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Accept-Encoding: deflate" "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022&filetype=EDM" >> /dev/null ; date Mon May 23 19:40:33 CEST 2022 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21.6M 0 21.6M 0 0 214k 0 --:--:-- 0:01:43 --:--:-- 215k Mon May 23 19:42:16 CEST 2022 ```

belforte commented 2 years ago

Maybe put a brute force cut at the root: if a file has more lumis than events, do not pass around lumi info. So FileMetaData never get too large.

amaltaro commented 2 years ago

Without reading this issue, let me bold enough and ask/suggest something: do lumi ranges help in this context? I know that there are cases where random runs/lumis could potentially be worse in the format of ranges than a flat list of them. But if it's mostly sequential, then it might save a lot.

belforte commented 2 years ago

hmm... thanks @amaltaro , that's a good point to enlarge our horizon. But I am not sure we can put lumi ranges in DBS, we need to list the number of events in each. We could compress somehow and then expand when there are N sequential lumis with same number of events, it is a new format.

I still like the idea of forbidding this early in the game, I am not convinced that lumi info in DBS is useful for MC. At most we could store number of lumis per file, to allow processing files in multiple jobs. But.. will anybody care to find lumi #45237 in a simulated dataset ?

belforte commented 2 years ago

anyhow, this is clearly not a source of operational problems at this point. Reducing priority

dmwm / CRABServer

avoid retrieving FMD for a task in one go from client #7238