Open amaltaro opened 3 years ago
@todor-ivanov when you have time, can you please investigate this again? I have marked it as high prio, but feel free to discuss it with Valentin and Imran and lower the priority if possible.
I have just tested a workaround and I'm going to create a wmagent branch specific fix for that (because a rollback of the AgentStatusPoller changes).
Ok @amaltaro I will work on that ASAP.
It would be good to mention also the PR with the workaround you were talking about: https://github.com/dmwm/WMCore/pull/10453
FYI @vkuznet
@todor-ivanov Todor, after seeing this HN announcement from a few minutes ago: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1781.html
we should either try to resolve this issue right away; or raise this point with Imran/Valentin in that thread. Can you please follow this up?
Hi @amaltaro Working on it!
I created a new label called Tokens
, to be used for all the activity related to commissioning and support of tokens in WMCore services in general.
Also removed the BUG label from this issue, since it's actually not a bug, but a feature change.
IMO, this should be pushed to Q4 btw. However, if we do so, it becomes a blocker for the token-related activities that Valentin/Imran have been doing with CMSWEB...
Once again, my question is does it worth all of these efforts when you try to deal with auth within couch itself? As I proposed before, would it be much easier to move to aps+couchdb setup where aps will handle the auth and couch will do its job for replication and so on. I understand that you need to adopt new couch db data-format but if it is done I don't really see any need to spent any time on fixing auth layer in couch and dealing with this kind of errors since I demonstrated that aps+couch will just work (and it will support both x509 and tokens). My suggestion still remain and I suggest that you evaluate its benefits.
@amaltaro , I reviewed this issue and came to conclusion that replciation on port 8443 will not work on newly deployed couchdb images on k8s by definition. The replication process is issued by CouchDB itself. As such in order to start replication it needs to authenticate with our FE. The replication document will use https://cmsweb...
URL which will require authentication on FE. On port 443 (APS) we can do that by supplying token into replication document and CouchDB will use it to pass in HTTP replication request. While on port 8443 (XPS) we can not do it since stock CouchDB does not support x509 authentication protocol. For more technical details how to use replication document and APS/XPS please refer to this gist.
I do not know if anything can be done in this ticket. Please let me know your opinion on this.
@vkuznet in addition to my comment here: https://github.com/dmwm/WMCore/issues/11068#issuecomment-1253706596
I'd suggest you to have a look at a CouchDB configuration that actually does replication with x509 certificates, please look at this file (especially the ssl and replicator sections):
cmst1@vocms0263:/data/srv/wmagent/current $ vim config/couchdb/local.ini
CMSWEB-based CouchDB does not start any database replication, that's why it does not define those configuration sections.
@amaltaro , you are pointing to RPM based CMS version of couchdb. My point is the does the stock CouchDB (from CouchDB docker image) support x509? I do not know, but I can apply the configuration and see how it will go.
ok, happy to report that x509 replication now works, see update in my gist section Replication using x509. To get proxy cert we need to automate its creation on testX clusters, @muhammadimranfarooqi could you please do the following:
cron-proxy
and cron-token
cronjobs into couchdb namespace on all testX clusters
This will take care of proxy creation and also create appropriate tokens.@amaltaro , @goughes , @todor-ivanov Here is corresponding commit to local.ini
in test branch https://gitlab.cern.ch/cmsweb-k8s/services_config/-/commit/33020595ee4695a8243f0043954d22fcb62efe5a which is required to make x509 replication works in testX clusters. Please check it and make necessary changes.
I put corresponding ticket (CMSKUBENETES-183) to CMSWEB group to install corresponding crons for proxy/token access in couchdb namespace. Meanwhile, I added Referer
to auth-proxy-server codebase via the following https://github.com/dmwm/auth-proxy-server/commit/64507d8e6a00e1c90995f4abed65b8d3494442d5 Once I"ll get new aps image and test it, it will be applied to all testX clusters
@vkuznet I just updated the CMSKubernetes JIRA ticket. Given that we do not trigger replications from central CouchDB, I'd rather keep things as simple as we can and not have extra pods/crons performing actions that we don't really need. That means, we should revert the services_config change that you made.
Now that you tested it between central CouchDB instances, I see no reason why it wouldn't work between WMAgent and central CouchDB. This is still to be verified though.
Alan, this has nothing to do with central CouchDB. I concentrated to verify if replication works in k8s setup with APS/XPS which is what we'll end-up anyway. As such the work will be completed once we'll have proxy/tokens in place for all namespaces, including couchdb and proper changes to APS/XPS. There is not overhead and complications since testX clusters are setup for dev-groups with everything they need to do the work, as such proxy/tokens should be available in all namespaces. Said that, if you do not need ssl/replicator sections for local.ini in test branch I can easily remove these sections, but at least now I know that everything works in k8s setup and it allow WMCore team to move forward with APS/XPS migration for FE. Will you use replication in testX clusters is totally up to the team/users of that cluster.
Exactly my point. Now that you managed to verify it, we no longer need to have any of the special tweaks in the dev clusters. Honestly speaking, start a dev cluster from the very beginning is a bit annoying because there are a few details that need to be considered:
If we don't need to use this functionality, we better not even require it to be dealt with when we are working in the dev environment. My previous comment explain why we do not need it.
I think you misunderstand what I was saying. When CMSWEB operator creates testX clusters we have default namespaces and set of tasks which CMSWEB scripts do. These include:
Therefore, we do not require dev-team to create namespace, creates proxy/token. This will be part of default setup of the clusters. As such it will be less work on your side. What WMCore team will do is only
Meanwhile, I fixed/tested APS/XPS/SPS to setup Referrer, and requested to update them in k8s clusters via this ticket: https://its.cern.ch/jira/browse/CMSKUBERNETES-184 Once we'll deploy new version of APS/XPS/SPS they will have everything dev-team need to do their work with token and x509 authentication, including CouchDB replication (if it will be ever required).
Impact of the bug CouchDB vs WMAgent
Describe the bug With the changes provided in this PR: https://github.com/dmwm/WMCore/pull/10330
we explicitly set the CouchDB replication to go through the optional 8443 SSL port. Here [1] is a short version of the 3 replication documents in CouchDB
_replicator
database.It turns out this configuration makes database replication between WMAgent and cmsweb{-testbed} not to work. There are no errors and whatsoever, it's just that no documents are transferred and the replication
progress
is always 0.While testing the same changes against my private VM (e.g. alancc7-cloud1), it works just fine. Reason is, we do not change the default port unless it matches the cmsweb regex.
How to reproduce it Deploy an agent pointing to cmsweb-testbed and assign a test workflow without any input data. The GQ elements should get stuck in
Negotiating
status.Expected behavior The ideal and expected behaviour is: database replications should work just fine AND we should be able to move that traffic to port 8443.
As a very short-term workaround, it's fine if we simply recover the database replication and let it go through 443.
Additional context and error message Data retrieved using the
_active_tasks
CouchDB API (careful with sensitive data):[1] (user name replaced below)