dmwm / CRABServer

15 stars 38 forks source link

avoid retrieving FMD for a task in one go from client #7238

Open belforte opened 2 years ago

belforte commented 2 years ago

need to take care of FMD retrieval by CRAB Client during crab report https://github.com/dmwm/CRABServer/blob/1a823b40b15bda1a84b5a9d002c6278bc58c85eb/src/python/CRABInterface/HTCondorDataWorkflow.py#L151-L155

belforte commented 2 years ago

I do not see any easy solution.

No idea looks really appealing.

belforte commented 2 years ago

another possibility:

This of course requires change in the client side as well and could be the change to get back to having again ?subresource=report instead of ?subresource=report2 in the URL See https://mattermost.web.cern.ch/cms-o-and-c/pl/4phkf68sxbf9zyipzb3i9399xo for an initial discussion.

mapellidario commented 2 years ago

Actually, while browsing the WMCore code, i discovered that WMCore/REST already supports compression, but only zilib with level 9 [1], which is not the best 2.

How to use it: just add the header "Accept-Encoding: deflate" to the request.

Example with a simple request [3] and with a long request [4], which shows that the current compression does not really help us. So, if we want to pursue the compression route, we may need to push some changes upstream. We will not need to implement it from scratch, we can add zstd in parallel to the current zlib, but it will take some time nonetheless.

This is not a suggestion, but just something that I found that may be worth keeping in mind.

[1] https://github.com/dmwm/WMCore/blob/5e654f59a28cb95eb0759272f49ad4c7a33d731c/src/python/WMCore/REST/Server.py#L454

[3]:

220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94: task with 1 job

```plaintext > time curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94&filetype=EDM" HTTP/1.1 200 OK Date: Mon, 23 May 2022 18:06:30 GMT Server: CherryPy/17.4.0 Set-Cookie: cms-auth=afc255ee3015de4696eb881f180927e659bcdf38942556aa3da974d7c349a9f707d938f1d5f3933f;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT Content-Type: application/json Vary: Accept Cache-Control: max-age=3600 X-Rest-Status: 100 Etag: "414c8b8f9405761f3797ccad5f84d814d55913ac" Content-Length: 1081 X-Rest-Time: 5401.134 us CMS-Server-Time: D=46307 t=1653329190132496 {"result": [ "{\"taskname\": \"220523_101151:cmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94\", \"filetype\": \"EDM\", \"jobid\": \"1\", \"outdataset\": \"/CRAB_PrivateMC/cmsbot-crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa/USER\", \"acquisitionera\": \"null\", \"swversion\": \"CMSSW_12_4_X_2022-05-23-1100\", \"inevents\": 10, \"globaltag\": \"None\", \"publishname\": \"crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa\", \"location\": \"T2_CH_CERN\", \"tmplocation\": \"T2_CH_CERN\", \"runlumi\": {\"1\": {\"1\": \"10\"}}, \"adler32\": \"a9123644\", \"cksum\": 3177001619, \"md5\": \"asda\", \"lfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\", \"filesize\": 1848757, \"parents\": [], \"state\": null, \"created\": \"[]\", \"tmplfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\"}" ]} curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY 0.02s user 0.01s system 33% cpu 0.083 total > ``` ```plaintext > time curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Accept-Encoding: deflate" --compressed "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220523_101151%3Acmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94&filetype=EDM" HTTP/1.1 200 OK Date: Mon, 23 May 2022 18:07:18 GMT Server: CherryPy/17.4.0 Set-Cookie: cms-auth=b7624ebbbdec1073f0da7092da9b31f4283cf9cdb04a3e815ef6e3a8e92f24af266b72136de3f183;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT Content-Type: application/json Vary: Accept,Accept-Encoding Cache-Control: max-age=3600 X-Rest-Status: 100 Content-Encoding: deflate Etag: "414c8b8f9405761f3797ccad5f84d814d55913ac" Content-Length: 429 X-Rest-Time: 6492.138 us CMS-Server-Time: D=52472 t=1653329238293634 {"result": [ "{\"taskname\": \"220523_101151:cmsbot_crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94\", \"filetype\": \"EDM\", \"jobid\": \"1\", \"outdataset\": \"/CRAB_PrivateMC/cmsbot-crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa/USER\", \"acquisitionera\": \"null\", \"swversion\": \"CMSSW_12_4_X_2022-05-23-1100\", \"inevents\": 10, \"globaltag\": \"None\", \"publishname\": \"crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94-ebf25d05439aee92ffdc57c55c2bc7fa\", \"location\": \"T2_CH_CERN\", \"tmplocation\": \"T2_CH_CERN\", \"runlumi\": {\"1\": {\"1\": \"10\"}}, \"adler32\": \"a9123644\", \"cksum\": 3177001619, \"md5\": \"asda\", \"lfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\", \"filesize\": 1848757, \"parents\": [], \"state\": null, \"created\": \"[]\", \"tmplfn\": \"/store/user/cmsbot/CRAB_PrivateMC/crab_Jenkins_CMSSW_12_4_X_2022-05-23-1100_el8_amd64_gcc10_94/220523_101151/0000/minbias_1.root\"}" ]} curl -i --cert $X509_USER_PROXY --key $X509_USER_PROXY -H --compressed 0.01s user 0.02s system 30% cpu 0.102 total > ```

[4]:

220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022: task with many jobs, 168MB filemetadata.

```plaintext > date ; curl --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022&filetype=EDM" >> /dev/null ; date Mon May 23 19:45:20 CEST 2022 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 168M 0 168M 0 0 2779k 0 --:--:-- 0:01:02 --:--:-- 2796k Mon May 23 19:46:22 CEST 2022 > ``` notice that the compressed output is 12% of the original, but it took ~40s to compress (in the first 40s curl did not report any incoming data), and that curl reports the speed as averaged on the whole request time (215k*100s = 21MB, despite only 60s were requires for the transfer itself). These compression rate and time necessary for compression may be improved using a different compression algorithm. ```plaintext > date ; curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Accept-Encoding: deflate" "https://cmsweb-testbed.cern.ch/crabserver/prod/filemetadata?taskname=220506_133045%3Amabarros_crab_GS_Jpsi_20to40_Dstar_DPS_2016posVFP_13TeV_06-05-2022&filetype=EDM" >> /dev/null ; date Mon May 23 19:40:33 CEST 2022 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21.6M 0 21.6M 0 0 214k 0 --:--:-- 0:01:43 --:--:-- 215k Mon May 23 19:42:16 CEST 2022 ```
belforte commented 2 years ago

Maybe put a brute force cut at the root: if a file has more lumis than events, do not pass around lumi info. So FileMetaData never get too large.

amaltaro commented 2 years ago

Without reading this issue, let me bold enough and ask/suggest something: do lumi ranges help in this context? I know that there are cases where random runs/lumis could potentially be worse in the format of ranges than a flat list of them. But if it's mostly sequential, then it might save a lot.

belforte commented 2 years ago

hmm... thanks @amaltaro , that's a good point to enlarge our horizon. But I am not sure we can put lumi ranges in DBS, we need to list the number of events in each. We could compress somehow and then expand when there are N sequential lumis with same number of events, it is a new format.

I still like the idea of forbidding this early in the game, I am not convinced that lumi info in DBS is useful for MC. At most we could store number of lumis per file, to allow processing files in multiple jobs. But.. will anybody care to find lumi #45237 in a simulated dataset ?

belforte commented 2 years ago

anyhow, this is clearly not a source of operational problems at this point. Reducing priority