Closed todor-ivanov closed 3 years ago
While dealing with this we figured out there was a misalignment of some PRs and at some point we have lost the changes providing the capabilities to read parameters from the run.sh
file. Here is the PR fixing this:
And here are the relevant PRs to the deployment and K8 repositories fixing the above issue:
Yet another obstacle discovered.
Since we were creating a structure of docker images inheriting from one another, and in our case the reqmgr2ms-unmerged
is inheriting from reqmgr2ms
, we now need to create a separate directory for the image to be uploaded. This was not the case before because every microservice was using one and the same image.
Here is the PR to resolve this one: https://github.com/cms-sw/cms-docker/pull/142
And another one: Pod names should not contain capital letters and should be no longer than 15 chars. https://github.com/dmwm/CMSKubernetes/pull/667
Also the parameters to run.sh
file were not propagated correctly and the gfal2
was not included in the service running path:
https://github.com/dmwm/CMSKubernetes/pull/668/
We also found out that the difference in the RSE expression between the integration and production Rucio servers gives us problems while deploying. So we had to fix that and make it work for bot type of deployments. https://github.com/dmwm/CMSKubernetes/pull/670
Finally, we did manage to deploy in testbed, and bumped into the next problem on the row, now it seems the /data/status
interface of the service is not functioning [1] and the pods are constantly restarting and the service is iterating continuously without respecting its own poling cycle. @muhammadimranfarooqi had to disable livenessprobe
in order for the pods to stop retarting. (I'll let him explain if he wants to). I am going to search for the reason what is failing.
[1]
[root@ms-um-t1-5bc9bccd8d-l4gxf data]# cmsweb-ping --url=http://localhost:8360/ms-unmerged-t1/data/status --authz=/etc/hmac/hmac -verbose 0
Unable to get response from http://localhost:8360/ms-unmerged-t1/data/status, error: Get "http://localhost:8360/ms-unmerged-t1/data/status": dial tcp [::1]:8360: connect: connection refused[root@ms-um-t1-5bc9bccd8d-l4gxf data]#
@muhammadimranfarooqi just provided the instructions on how this was done. All that is needed is to have this [1] area in the service .yaml
file commented out.
The service is now running in DryRun mode in production.
[1]
2021-07-23 13:32:05,841:INFO:MSUnmerged: RSE: T1_FR_CCIN2P3_Disk With old consistency record in Rucio Consistency Monitor. Skipping it in the current run.
2021-07-23 13:32:05,842:ERROR:MSUnmerged: plineUnmerged: Run on RSE: T1_FR_CCIN2P3_Disk was interrupted due to: MSUnmergedException: MSUnmergedPlineExit: RSE: %s With old consistency record in Rucio Consistency Monitor. Skipping it in th
e current run..
Will retry again in the next cycle.
[2]
----------------------------------------------------------
MSUnmergedRSE:
counters:
deletedFail: 0
deletedSuccess: 0
dirsToDelete: 131
dirsToDeleteAll: 131
filesToDelete: 3148
totalNumFiles: 69753
dirs:
allUnmerged:
empty:
protected:
/store/unmerged/RunIISummer20UL16HLT/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p15_TuneCP5_13TeV-powheg-madspin-pythia8/GEN-SIM-RAW/80X_mcRun2_asymptotic_2016_TrancheIV_v6-v3
/store/unmerged/RunIISummer20UL16wmLHEGEN/GluGluZH_HToWWTo2L2Nu_M-125_TuneCP5_13TeV-powheg-pythia8/GEN/106X_mcRun2_asymptotic_v13-v1
/store/unmerged/RunIISummer16DR80Premix/Radion_hh_hVVhbb_M2000_W0p1_TuneCP5_13TeV-madgraph-pythia8/AODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1
/store/unmerged/RunIISummer20UL18MiniAODv2/NMSSM_XtoHYto4b_MX_2600_MY_500_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1
/store/unmerged/RunIISummer20UL16NanoAODv9/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v1
/store/unmerged/RunIISummer20UL16MiniAOD/QCD_Pt-15To20_MuEnrichedPt5_TuneCP5_13TeV-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v13-v1
...
toDelete:
/store/unmerged/RunIISummer20UL17wmLHEGEN/QCD_HT200to300_TuneCP5_PSWeights_13TeV-madgraph-pythia8/LHE/106X_mc2017_realistic_v6-v1
/store/unmerged/RunIISummer20UL16wmLHEGENAPV/GJets_DR-0p4_HT-200To400_TuneCP5_13TeV-madgraphMLM-pythia8/LHE/106X_mcRun2_asymptotic_preVFP_v8-v1
/store/unmerged/RunIISummer20UL16wmLHEGEN/GJets_DR-0p4_HT-100To200_TuneCP5_13TeV-madgraphMLM-pythia8/GEN/106X_mcRun2_asymptotic_v13-v1
/store/unmerged/RunIISummer20UL16RECO/ZprimeToBB_narrow_M-2000_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2
/store/unmerged/RunIISummer20UL16wmLHEGEN/ZHToTauTau_M125_CP5_13TeV-powheg-pythia8_ext1/LHE/106X_mcRun2_asymptotic_v13-v1
/store/unmerged/RunIISummer20UL16wmLHEGENAPV/ZJetsToNuNu_HT-400To600_TuneCP5_13TeV-madgraphMLM-pythia8/GEN/106X_mcRun2_asymptotic_preVFP_v8-v1
...
files:
allUnmerged:
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/19130AC0-8972-5246-AF9A-B2ECB3D8EDBF.root
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/61512FF5-107A-E344-B174-A76CB1FEE0F2.root
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/9BF15D39-DDFE-014E-A0C6-4D516E3829BF.root
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/120000/CBE3C955-9C44-B946-8F57-A89F29C792EE.root
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/70000/0C2E18D8-5318-2249-ADA6-265820024D43.root
/store/unmerged/Run2016D/DisplacedJet/MINIAOD/HIPM_UL2016_MiniAODv2-v1/70000/36E582B7-2429-7E4D-8752-07407DE11681.root
...
deletedFail:
deletedSuccess:
protected:
toDelete:
/store/unmerged/Run2016H/DoubleMuonLowMass/AOD/21Feb2020_UL2016-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a540a2eb0>
/store/unmerged/Run2016H/DoubleMuonLowMass/MINIAOD/21Feb2020_UL2016-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a5141f3c0>
/store/unmerged/RunIIAutumn18MiniAOD/ggXToJPsiJPsi_JPsiToMuMu_M9p0_JPCZeroMinusPlus_TuneCP5_13TeV-pythia8-JHUGen/MINIAODSIM/102X_upgrade2018_realistic_v15-v4: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a514244a0>
/store/unmerged/RunIIAutumn18NanoAODv7/ggXToJPsiJPsi_JPsiToMuMu_M9p0_JPCZeroMinusPlus_TuneCP5_13TeV-pythia8-JHUGen/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21-v4: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a514243c0>
/store/unmerged/RunIIFall17MiniAODv2/ZpAnomalonHZ_UFO-Zp3500-ND600-NS200_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a51429b30>
/store/unmerged/RunIIFall17wmLHEGS/Suu_Diquark_S4000_chi760_TuneCP2_13TeV-madgraph-pythia8/GEN-SIM/93X_mc2017_realistic_v3-v1: <generator object MSUnmerged.filterUnmergedFiles.<locals>.genFunc at 0x7f6a540a2dd0>
...
isClean: False
name: T1_ES_PIC_Disk
pfnPrefix: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/disk
----------------------------------------------------------
After the service has been enabled in RealRun mode we experienced a somehow not surprising problem - The voms proxy we use for the service is not being able to delete files at every site. Here is a brief summary of the partial success with what we have so far:
The full list of RSEs fetched is as followes:
And the reason for that is now known. Even though we have setup the proper roles for the certificate at the voms server and CRIC too: [1], we do not issue the voms proxy with the correct voms extensions and role from inside the container. Here is the proxy renewal script that is used in the standard docker containers for the CMSKubernetes setup: [2]. It is clearly seen that the line:
sudo voms-proxy-init -voms cms -rfc -key /data/certs/robotkey.pem -cert /data/certs/robotcert.pem
should look like:
sudo voms-proxy-init -rfc -key /data/certs/robotkey.pem -cert /data/certs/robotcert.pem --voms cms:/cms/Role=production --valid 240:00
So in order to fix that we will need to create another script jsut for our setup, substitute the current one. Setup a cron jobs that runs ours instead of the standard one [3].
And since are going to do all this we may also consider changing this certificate with a dedicated one as explained in this issue: https://github.com/dmwm/WMCore/issues/10680
[1] https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=roles
{
"DN": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmswebmg/CN=779320/CN=Robot: Service account for CMSWeb migration",
"ID": null,
"LOGIN": "cmswebmg@service.cern.ch",
"NAME": "cmswebmg Service",
"ROLES": {
"admin": [
"group:phedex"
],
"operator": [
"group:dbs"
],
"production-operator": [
"group:dataops"
],
"web-service": [
"group:facops"
]
}
}
[2] https://github.com/dmwm/CMSKubernetes/blob/master/docker/cmsweb/proxy.sh#L21
[3] https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/crons/cron-proxy.yaml
For the above to be solved what @muhammadimranfarooqi had to do is to substitute those two files: https://github.com/dmwm/CMSKubernetes/blob/master/docker/proxy/proxy.sh#L37-L39 https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/crons/cron-proxy.yaml
I have no idea why the above proxy.sh
(the general one for all services) instead of the one we were previously discussing to be changed:
https://github.com/dmwm/CMSKubernetes/blob/master/docker/cmsweb/proxy.sh#L21
In order to achieve that he had to do these steps:
Create a new cron job for msUnmerged only
https://github.com/dmwm/CMSKubernetes/pull/672
which is using this command to issue the certificate (How and where the change of the scripts executing this command is actually done is a total mystery to me)
voms-proxy-init -rfc \
-key /tmp/robotkey.pem \
-cert /tmp/robotcert.pem \
--voms cms:/cms/Role=production --valid 192:00 \
-out /tmp/proxy
Refer to this cronjob from the yaml files of the MSUnmerged containers only: https://github.com/dmwm/CMSKubernetes/pull/671
Rebuild secrets:
kubectl create secret generic proxy-secrets-ms-unmerged \
--from-file=/tmp/proxy --dry-run=client -o yaml | \
kubectl apply --validate=false -f -
kubectl get secrets -n dmwm | grep proxy-se
proxy-secrets Opaque 1 136d
proxy-secrets-ms-unmerged Opaque 1 33m
And here are the commits provided by @muhammadimranfarooqi in order to make the above change permanent: https://github.com/dmwm/CMSKubernetes/pull/682 https://github.com/cms-sw/cms-docker/pull/143
For the new proxy to be created a new image in docker hub is needed: https://hub.docker.com/r/cmssw/proxy-ms-unmerged Which is then mounted in the pod under the proper path... Intentionally not going into too deep details in this explanation here.
Yet another obstacle. Now we are breaking dependencies for every other microservice, with the following error: [1]. The reason is because gfal2
was installed only in the reqmgr2ms-unmerged
docker container, with these lines: [2]. But on the other hand we have the MSManager
trying to import the MSUnmerged
[3] [4] no matter which is the micro service to be executed and hence the chain of dependencies which brings it for all of the others, which on the other side does not have the gfal2
package installed.
So the solution should be to move the gfal2
dependency one level up. since now the reqmgr2ms-unmerged
docker image inherits from reqmgr2ms
then simply those lines from [2] need to be moved in [5]
[1]
ERROR initializing MicroService REST module. - Podname=ms-monitor-6d8d994dd-bptd4
Traceback (most recent call last): - Podname=ms-monitor-6d8d994dd-bptd4
File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/Service/Data.py", line 57, in __init__ - Podname=ms-monitor-6d8d994dd-bptd4
module = importlib.import_module('.'.join(arr[:-1])) - Podname=ms-monitor-6d8d994dd-bptd4
File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/importlib/__init__.py", line 127, in import_module - Podname=ms-monitor-6d8d994dd-bptd4
return _bootstrap._gcd_import(name[level:], package, level) - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap>", line 991, in _find_and_load - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap_external>", line 783, in exec_module - Podname=ms-monitor-6d8d994dd-bptd4
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed - Podname=ms-monitor-6d8d994dd-bptd4
File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSManager.py", line 39, in <module> - Podname=ms-monitor-6d8d994dd-bptd4
from WMCore.MicroService.MSUnmerged.MSUnmerged import MSUnmerged - Podname=ms-monitor-6d8d994dd-bptd4
File "/data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 24, in <module> - Podname=ms-monitor-6d8d994dd-bptd4
import gfal2 - Podname=ms-monitor-6d8d994dd-bptd4
ModuleNotFoundError: No module named 'gfal2' - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] INFO: loading ui into /ms-monitor - Podname=ms-monitor-6d8d994dd-bptd4
FrontPage static content: /data/srv/HG2108e/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.1.pre3/lib/python3.8/data/ - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] INFO: starting server in /data/srv/state/reqmgr2ms - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Bus STARTING - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Serving on http://0.0.0.0:8248 - Podname=ms-monitor-6d8d994dd-bptd4
[28/Jul/2021:08:45:36] ENGINE Bus STARTED - Podname=ms-monitor-6d8d994dd-bptd4
[5] https://github.com/dmwm/CMSKubernetes/blob/master/docker/reqmgr2ms/Dockerfile
@todor-ivanov Todor, I do not like imports in the middle of the code, neither it's accepted by the coding style tools.
However, for this case I think it would make sense to only import the microservice that we are meant to execute, thus, adding those imports to the __init__
method, here:
https://github.com/dmwm/WMCore/blob/21b48cf8e54cf319e2e260138ff5432eb716a07d/src/python/WMCore/MicroService/MSManager.py#L69
example:
if 'transferor' in self.services:
from WMCore.MicroService.MSTransferor.MSTransferor import MSTransferor
self.msTransferor = MSTransferor(self.msConfig, logger=self.logger)
etc etc
What do you think?
Thanks @amaltaro, I do not like them either. I think @muhammadimranfarooqi just deployed in testbed with the so proposed solution may fly. Lets give it a chance. Even though it is breaking the idea of having this dependency isolated from the docker images of the rest of the micro services.
Honestly, mixing system and local python libraries is dangerous and might cause us problems in the future. Our WMCore services are meant to be self-contained (at least in terms of python packages), and we should make an effort to keep it like that.
Having said that, I'd highly recommend to isolate the PYTHONPATH override to the MSUnmerged image only. That means, we keep deploying gfal2
python bindings only in the MSUnmerged Dockerfile, opposed to the PR merged today, here:
https://github.com/dmwm/CMSKubernetes/pull/692
by isolating it to the MSUnmerged, we will likely need to have the WMCore changes I mentioned in my previous message though. Still, it's less risky than the current model IMO.
@todor-ivanov can you please coordinate this with Imran? Otherwise, I can take care of this tomorrow morning.
As discussed with @todor-ivanov yesterday morning, since the service has already been deployed to CMSWEB production and all the necessary deployment/credentials has been sorted, we can now close this issue.
I was on vacation when most of this happen, but here is a summary of everything that went in to be able to deploy and properly run it in production:
Thank you very much for all the work done here, Todor. Feel free to complement and/or reopen it if I missed anything.
Impact of the new feature MSUnmerged
Is your feature request related to a problem? Please describe. Following the decisions taken about how to scale the service, we decided that we will deploy three different instances taking care of three different subsets of RSEs. What we have missed in the previous step was that they would also need to have deployment names changed and also having a dedicated port assigned for each one of them. More or less it is a triple deployment.
Describe the solution you'd like
Describe alternatives you've considered None
Additional context None