dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSUnmerged: Assign dedicated ports to every service instance && Deploy the service. #10682

Closed todor-ivanov closed 3 years ago

todor-ivanov commented 3 years ago

Impact of the new feature MSUnmerged

Is your feature request related to a problem? Please describe. Following the decisions taken about how to scale the service, we decided that we will deploy three different instances taking care of three different subsets of RSEs. What we have missed in the previous step was that they would also need to have deployment names changed and also having a dedicated port assigned for each one of them. More or less it is a triple deployment.

Describe the solution you'd like

Describe alternatives you've considered None

Additional context None

todor-ivanov commented 3 years ago

While dealing with this we figured out there was a misalignment of some PRs and at some point we have lost the changes providing the capabilities to read parameters from the run.sh file. Here is the PR fixing this:

https://github.com/dmwm/CMSKubernetes/pull/665

todor-ivanov commented 3 years ago

And here are the relevant PRs to the deployment and K8 repositories fixing the above issue:

https://github.com/dmwm/deployment/pull/1067

https://github.com/dmwm/CMSKubernetes/pull/666

todor-ivanov commented 3 years ago

Yet another obstacle discovered. Since we were creating a structure of docker images inheriting from one another, and in our case the reqmgr2ms-unmerged is inheriting from reqmgr2ms, we now need to create a separate directory for the image to be uploaded. This was not the case before because every microservice was using one and the same image.

Here is the PR to resolve this one: https://github.com/cms-sw/cms-docker/pull/142

todor-ivanov commented 3 years ago

And another one: Pod names should not contain capital letters and should be no longer than 15 chars. https://github.com/dmwm/CMSKubernetes/pull/667

Also the parameters to run.sh file were not propagated correctly and the gfal2 was not included in the service running path: https://github.com/dmwm/CMSKubernetes/pull/668/

todor-ivanov commented 3 years ago

We also found out that the difference in the RSE expression between the integration and production Rucio servers gives us problems while deploying. So we had to fix that and make it work for bot type of deployments. https://github.com/dmwm/CMSKubernetes/pull/670

todor-ivanov commented 3 years ago

Finally, we did manage to deploy in testbed, and bumped into the next problem on the row, now it seems the /data/status interface of the service is not functioning [1] and the pods are constantly restarting and the service is iterating continuously without respecting its own poling cycle. @muhammadimranfarooqi had to disable livenessprobe in order for the pods to stop retarting. (I'll let him explain if he wants to). I am going to search for the reason what is failing.

[1]

[root@ms-um-t1-5bc9bccd8d-l4gxf data]# cmsweb-ping --url=http://localhost:8360/ms-unmerged-t1/data/status --authz=/etc/hmac/hmac -verbose 0
Unable to get response from http://localhost:8360/ms-unmerged-t1/data/status, error: Get "http://localhost:8360/ms-unmerged-t1/data/status": dial tcp [::1]:8360: connect: connection refused[root@ms-um-t1-5bc9bccd8d-l4gxf data]# 
todor-ivanov commented 3 years ago

@muhammadimranfarooqi just provided the instructions on how this was done. All that is needed is to have this [1] area in the service .yaml file commented out.

[1] https://github.com/dmwm/CMSKubernetes/blob/f32c8b923b99f5e16689d1c423b24cd888f5566a/kubernetes/cmsweb/services/reqmgr2ms-unmerged-T1.yaml#L69-L78

todor-ivanov commented 3 years ago

The service is now running in DryRun mode in production.

[1]

2021-07-23 13:32:05,841:INFO:MSUnmerged: RSE: T1_FR_CCIN2P3_Disk With old consistency record in Rucio Consistency Monitor. Skipping it in the current run.
2021-07-23 13:32:05,842:ERROR:MSUnmerged: plineUnmerged: Run on RSE: T1_FR_CCIN2P3_Disk was interrupted due to: MSUnmergedException: MSUnmergedPlineExit: RSE: %s With old consistency record in Rucio Consistency Monitor. Skipping it in th
e current run.. 
Will retry again in the next cycle.

[2]

----------------------------------------------------------
MSUnmergedRSE: 
    counters: 
        deletedFail: 0
        deletedSuccess: 0
        dirsToDelete: 131
        dirsToDeleteAll: 131
        filesToDelete: 3148
        totalNumFiles: 69753
    dirs: 
        allUnmerged: 
        empty: 
        protected: 
            /store/unmerged/RunIISummer20UL16HLT/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p15_TuneCP5_13TeV-powheg-madspin-pythia8/GEN-SIM-RAW/80X_mcRun2_asymptotic_2016_TrancheIV_v6-v3
            /store/unmerged/RunIISummer20UL16wmLHEGEN/GluGluZH_HToWWTo2L2Nu_M-125_TuneCP5_13TeV-powheg-pythia8/GEN/106X_mcRun2_asymptotic_v13-v1
            /store/unmerged/RunIISummer16DR80Premix/Radion_hh_hVVhbb_M2000_W0p1_TuneCP5_13TeV-madgraph-pythia8/AODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1
            /store/unmerged/RunIISummer20UL18MiniAODv2/NMSSM_XtoHYto4b_MX_2600_MY_500_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1
            /store/unmerged/RunIISummer20UL16NanoAODv9/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v1
            /store/unmerged/RunIISummer20UL16MiniAOD/QCD_Pt-15To20_MuEnrichedPt5_TuneCP5_13TeV-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v13-v1
            ...
        toDelete: 
            /store/unmerged/RunIISummer20UL17wmLHEGEN/QCD_HT200to300_TuneCP5_PSWeights_13TeV-madgraph-pythia8/LHE/106X_mc2017_realistic_v6-v1
            /store/unmerged/RunIISummer20UL16wmLHEGENAPV/GJets_DR-0p4_HT-200To400_TuneCP5_13TeV-madgraphMLM-pythia8/LHE/106X_mcRun2_asymptotic_preVFP_v8-v1
            /store/unmerged/RunIISummer20UL16wmLHEGEN/GJets_DR-0p4_HT-100To200_TuneCP5_13TeV-madgraphMLM-pythia8/GEN/106X_mcRun2_asymptotic_v13-v1
            /store/unmerged/RunIISummer20UL16RECO/ZprimeToBB_narrow_M-2000_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2
            /store/unmerged/RunIISummer20UL16wmLHEGEN/ZHToTauTau_M125_CP5_13TeV-powheg-pythia8_ext1/LHE/106X_mcRun2_asymptotic_v13-v1
            /store/unmerged/RunIISummer20UL16wmLHEGENAPV/ZJetsToNuNu_HT-400To600_TuneCP5_13TeV-madgraphMLM-pythia8/GEN/106X_mcRun2_asymptotic_preVFP_v8-v1
            ...
    files: 
        allUnmerged: 
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/19130AC0-8972-5246-AF9A-B2ECB3D8EDBF.root
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/61512FF5-107A-E344-B174-A76CB1FEE0F2.root
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/9BF15D39-DDFE-014E-A0C6-4D516E3829BF.root
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/CBE3C955-9C44-B946-8F57-A89F29C792EE.root
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/70000/0C2E18D8-5318-2249-ADA6-265820024D43.root
            /store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/70000/36E582B7-2429-7E4D-8752-07407DE11681.root
            ...
        deletedFail: 
        deletedSuccess: 
        protected: 
        toDelete: 
            /store/unmerged/Run2016H/DoubleMuonLowMass/AOD/21Feb2020_UL2016-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a540a2eb0>
            /store/unmerged/Run2016H/DoubleMuonLowMass/MINIAOD/21Feb2020_UL2016-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a5141f3c0>
            /store/unmerged/RunIIAutumn18MiniAOD/ggXToJPsiJPsi_JPsiToMuMu_M9p0_JPCZeroMinusPlus_TuneCP5_13TeV-pythia8-JHUGen/MINIAODSIM/102X_upgrade2018_realistic_v15-v4: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a514244a0>
            /store/unmerged/RunIIAutumn18NanoAODv7/ggXToJPsiJPsi_JPsiToMuMu_M9p0_JPCZeroMinusPlus_TuneCP5_13TeV-pythia8-JHUGen/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21-v4: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a514243c0>
            /store/unmerged/RunIIFall17MiniAODv2/ZpAnomalonHZ_UFO-Zp3500-ND600-NS200_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a51429b30>
            /store/unmerged/RunIIFall17wmLHEGS/Suu_Diquark_S4000_chi760_TuneCP2_13TeV-madgraph-pythia8/GEN-SIM/93X_mc2017_realistic_v3-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a540a2dd0>
            ...
    isClean: False
    name: T1_ES_PIC_Disk
    pfnPrefix: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/disk

----------------------------------------------------------
todor-ivanov commented 3 years ago

After the service has been enabled in RealRun mode we experienced a somehow not surprising problem - The voms proxy we use for the service is not being able to delete files at every site. Here is a brief summary of the partial success with what we have so far:

The full list of RSEs fetched is as followes:

And the reason for that is now known. Even though we have setup the proper roles for the certificate at the voms server and CRIC too: [1], we do not issue the voms proxy with the correct voms extensions and role from inside the container. Here is the proxy renewal script that is used in the standard docker containers for the CMSKubernetes setup: [2]. It is clearly seen that the line:

sudo voms-proxy-init -voms cms -rfc -key /data/certs/robotkey.pem -cert /data/certs/robotcert.pem

should look like:

sudo voms-proxy-init -rfc -key /data/certs/robotkey.pem -cert /data/certs/robotcert.pem --voms cms:/cms/Role=production --valid 240:00

So in order to fix that we will need to create another script jsut for our setup, substitute the current one. Setup a cron jobs that runs ours instead of the standard one [3].

And since are going to do all this we may also consider changing this certificate with a dedicated one as explained in this issue: https://github.com/dmwm/WMCore/issues/10680

[1] https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=roles

  {
    "DN": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmswebmg/CN=779320/CN=Robot: Service account for CMSWeb migration", 
    "ID": null, 
    "LOGIN": "cmswebmg@service.cern.ch", 
    "NAME": "cmswebmg Service", 
    "ROLES": {
      "admin": [
        "group:phedex"
      ], 
      "operator": [
        "group:dbs"
      ], 
      "production-operator": [
        "group:dataops"
      ], 
      "web-service": [
        "group:facops"
      ]
    }
  }

[2] https://github.com/dmwm/CMSKubernetes/blob/master/docker/cmsweb/proxy.sh#L21

[3] https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/crons/cron-proxy.yaml

todor-ivanov commented 3 years ago

For the above to be solved what @muhammadimranfarooqi had to do is to substitute those two files: https://github.com/dmwm/CMSKubernetes/blob/master/docker/proxy/proxy.sh#L37-L39 https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/crons/cron-proxy.yaml

I have no idea why the above proxy.sh (the general one for all services) instead of the one we were previously discussing to be changed: https://github.com/dmwm/CMSKubernetes/blob/master/docker/cmsweb/proxy.sh#L21

In order to achieve that he had to do these steps:

kubectl get secrets -n dmwm | grep proxy-se
proxy-secrets                   Opaque                                1      136d
proxy-secrets-ms-unmerged       Opaque                                1      33m
todor-ivanov commented 3 years ago

And here are the commits provided by @muhammadimranfarooqi in order to make the above change permanent: https://github.com/dmwm/CMSKubernetes/pull/682 https://github.com/cms-sw/cms-docker/pull/143

For the new proxy to be created a new image in docker hub is needed: https://hub.docker.com/r/cmssw/proxy-ms-unmerged Which is then mounted in the pod under the proper path... Intentionally not going into too deep details in this explanation here.

todor-ivanov commented 3 years ago

Yet another obstacle. Now we are breaking dependencies for every other microservice, with the following error: [1]. The reason is because gfal2 was installed only in the reqmgr2ms-unmerged docker container, with these lines: [2]. But on the other hand we have the MSManager trying to import the MSUnmerged [3] [4] no matter which is the micro service to be executed and hence the chain of dependencies which brings it for all of the others, which on the other side does not have the gfal2 package installed.

So the solution should be to move the gfal2 dependency one level up. since now the reqmgr2ms-unmerged docker image inherits from reqmgr2ms then simply those lines from [2] need to be moved in [5]

[1]

ERROR initializing MicroService REST module. - Podname=ms-monitor-6d8d994dd-bptd4
Traceback (most recent call last): - Podname=ms-monitor-6d8d994dd-bptd4
  File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/Service/Data.py", line 57, in __init__ - Podname=ms-monitor-6d8d994dd-bptd4
    module = importlib.import_module('.'.join(arr[:-1])) - Podname=ms-monitor-6d8d994dd-bptd4
  File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/importlib/__init__.py", line 127, in import_module - Podname=ms-monitor-6d8d994dd-bptd4
    return _bootstrap._gcd_import(name[level:], package, level) - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module - Podname=ms-monitor-6d8d994dd-bptd4
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed - Podname=ms-monitor-6d8d994dd-bptd4
  File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSManager.py", line 39, in <module> - Podname=ms-monitor-6d8d994dd-bptd4
    from WMCore.MicroService.MSUnmerged.MSUnmerged import MSUnmerged - Podname=ms-monitor-6d8d994dd-bptd4
  File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 24, in <module> - Podname=ms-monitor-6d8d994dd-bptd4
    import gfal2 - Podname=ms-monitor-6d8d994dd-bptd4
ModuleNotFoundError: No module named 'gfal2' - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36]  INFO: loading ui into /ms-monitor - Podname=ms-monitor-6d8d994dd-bptd4
FrontPage static content: /data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/data/ - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36]  INFO: starting server in /data/srv/state/reqmgr2ms - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Bus STARTING - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Serving on http://0.0.0.0:8248 - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Bus STARTED - Podname=ms-monitor-6d8d994dd-bptd4

[2] https://github.com/dmwm/CMSKubernetes/blob/5d61d7c429984ed69075e274e745d0d6b350aaf5/docker/reqmgr2ms-unmerged/Dockerfile#L13-L22

[3] https://github.com/dmwm/WMCore/blob/21b48cf8e54cf319e2e260138ff5432eb716a07d/src/python/WMCore/MicroService/MSManager.py#L39

[4] https://github.com/dmwm/WMCore/blob/21b48cf8e54cf319e2e260138ff5432eb716a07d/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L24

[5] https://github.com/dmwm/CMSKubernetes/blob/master/docker/reqmgr2ms/Dockerfile

amaltaro commented 3 years ago

@todor-ivanov Todor, I do not like imports in the middle of the code, neither it's accepted by the coding style tools. However, for this case I think it would make sense to only import the microservice that we are meant to execute, thus, adding those imports to the __init__ method, here: https://github.com/dmwm/WMCore/blob/21b48cf8e54cf319e2e260138ff5432eb716a07d/src/python/WMCore/MicroService/MSManager.py#L69

example:

        if 'transferor' in self.services:
            from WMCore.MicroService.MSTransferor.MSTransferor import MSTransferor
            self.msTransferor = MSTransferor(self.msConfig, logger=self.logger)
etc etc

What do you think?

todor-ivanov commented 3 years ago

Thanks @amaltaro, I do not like them either. I think @muhammadimranfarooqi just deployed in testbed with the so proposed solution may fly. Lets give it a chance. Even though it is breaking the idea of having this dependency isolated from the docker images of the rest of the micro services.

amaltaro commented 3 years ago

Honestly, mixing system and local python libraries is dangerous and might cause us problems in the future. Our WMCore services are meant to be self-contained (at least in terms of python packages), and we should make an effort to keep it like that.

Having said that, I'd highly recommend to isolate the PYTHONPATH override to the MSUnmerged image only. That means, we keep deploying gfal2 python bindings only in the MSUnmerged Dockerfile, opposed to the PR merged today, here: https://github.com/dmwm/CMSKubernetes/pull/692

by isolating it to the MSUnmerged, we will likely need to have the WMCore changes I mentioned in my previous message though. Still, it's less risky than the current model IMO.

@todor-ivanov can you please coordinate this with Imran? Otherwise, I can take care of this tomorrow morning.

amaltaro commented 3 years ago

As discussed with @todor-ivanov yesterday morning, since the service has already been deployed to CMSWEB production and all the necessary deployment/credentials has been sorted, we can now close this issue.

I was on vacation when most of this happen, but here is a summary of everything that went in to be able to deploy and properly run it in production:

Thank you very much for all the work done here, Todor. Feel free to complement and/or reopen it if I missed anything.