dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

CouchDB replication not working on port 8443 #10452

Open amaltaro opened 3 years ago

amaltaro commented 3 years ago

Impact of the bug CouchDB vs WMAgent

Describe the bug With the changes provided in this PR: https://github.com/dmwm/WMCore/pull/10330

we explicitly set the CouchDB replication to go through the optional 8443 SSL port. Here [1] is a short version of the 3 replication documents in CouchDB _replicator database.

It turns out this configuration makes database replication between WMAgent and cmsweb{-testbed} not to work. There are no errors and whatsoever, it's just that no documents are transferred and the replication progress is always 0.

While testing the same changes against my private VM (e.g. alancc7-cloud1), it works just fine. Reason is, we do not change the default port unless it matches the cmsweb regex.

How to reproduce it Deploy an agent pointing to cmsweb-testbed and assign a test workflow without any input data. The GQ elements should get stuck in Negotiating status.

Expected behavior The ideal and expected behaviour is: database replications should work just fine AND we should be able to move that traffic to port 8443.

As a very short-term workaround, it's fine if we simply recover the database replication and let it go through 443.

Additional context and error message Data retrieved using the _active_tasks CouchDB API (careful with sensitive data):

[1] (user name replaced below)

$ curl http://USER:*****@localhost:5984/_active_tasks
[{u'checkpoint_interval': 600000,
  u'checkpointed_source_seq': 0,
  u'continuous': True,
  u'doc_id': u'10ca19d25569b94b5d26fb98b4002195',
  u'doc_write_failures': 0,
  u'docs_read': 0,
  u'docs_written': 0,
  u'missing_revisions_found': 0,
  u'pid': u'<0.511.0>',
  u'progress': 0,
  u'replication_id': u'4e1e89ad05fc11fab62a08741c5a1af8+continuous',
  u'revisions_checked': 0,
  u'source': u'http://USER:*****@localhost:5984/wmagent_summary/',
  u'source_seq': 1,
  u'started_on': 1618559813,
  u'target': u'https://cmsweb-testbed.cern.ch:8443/couchdb/wmstats/',
  u'type': u'replication',
  u'updated_on': 1618559813},
 {u'checkpoint_interval': 600000,
  u'checkpointed_source_seq': 0,
  u'continuous': True,
  u'doc_id': u'10ca19d25569b94b5d26fb98b4002371',
  u'doc_write_failures': 0,
  u'docs_read': 0,
  u'docs_written': 0,
  u'missing_revisions_found': 0,
  u'pid': u'<0.552.0>',
  u'progress': 0,
  u'replication_id': u'd8c742b9323ce61f4b156fa1537d634a+continuous',
  u'revisions_checked': 0,
  u'source': u'https://cmsweb-testbed.cern.ch:8443/couchdb/workqueue/',
  u'source_seq': 249477,
  u'started_on': 1618559814,
  u'target': u'http://USER:*****@localhost:5984/workqueue_inbox/',
  u'type': u'replication',
  u'updated_on': 1618559814},
 {u'checkpoint_interval': 600000,
  u'checkpointed_source_seq': 0,
  u'continuous': True,
  u'doc_id': u'10ca19d25569b94b5d26fb98b400316f',
  u'doc_write_failures': 0,
  u'docs_read': 0,
  u'docs_written': 0,
  u'missing_revisions_found': 0,
  u'pid': u'<0.588.0>',
  u'progress': 0,
  u'replication_id': u'f52f36c998cc9a83f9a88504dea37236+continuous',
  u'revisions_checked': 0,
  u'source': u'http://localhost:5984/workqueue_inbox/',
  u'source_seq': 1,
  u'started_on': 1618559814,
  u'target': u'https://cmsweb-testbed.cern.ch:8443/couchdb/workqueue/',
  u'type': u'replication',
  u'updated_on': 1618559814}]
amaltaro commented 3 years ago

@todor-ivanov when you have time, can you please investigate this again? I have marked it as high prio, but feel free to discuss it with Valentin and Imran and lower the priority if possible.

I have just tested a workaround and I'm going to create a wmagent branch specific fix for that (because a rollback of the AgentStatusPoller changes).

todor-ivanov commented 3 years ago

Ok @amaltaro I will work on that ASAP.

It would be good to mention also the PR with the workaround you were talking about: https://github.com/dmwm/WMCore/pull/10453

FYI @vkuznet

amaltaro commented 3 years ago

@todor-ivanov Todor, after seeing this HN announcement from a few minutes ago: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1781.html

we should either try to resolve this issue right away; or raise this point with Imran/Valentin in that thread. Can you please follow this up?

todor-ivanov commented 3 years ago

Hi @amaltaro Working on it!

amaltaro commented 3 years ago

I created a new label called Tokens, to be used for all the activity related to commissioning and support of tokens in WMCore services in general.

Also removed the BUG label from this issue, since it's actually not a bug, but a feature change.

amaltaro commented 3 years ago

IMO, this should be pushed to Q4 btw. However, if we do so, it becomes a blocker for the token-related activities that Valentin/Imran have been doing with CMSWEB...

vkuznet commented 2 years ago

Once again, my question is does it worth all of these efforts when you try to deal with auth within couch itself? As I proposed before, would it be much easier to move to aps+couchdb setup where aps will handle the auth and couch will do its job for replication and so on. I understand that you need to adopt new couch db data-format but if it is done I don't really see any need to spent any time on fixing auth layer in couch and dealing with this kind of errors since I demonstrated that aps+couch will just work (and it will support both x509 and tokens). My suggestion still remain and I suggest that you evaluate its benefits.

vkuznet commented 1 year ago

@amaltaro , I reviewed this issue and came to conclusion that replciation on port 8443 will not work on newly deployed couchdb images on k8s by definition. The replication process is issued by CouchDB itself. As such in order to start replication it needs to authenticate with our FE. The replication document will use https://cmsweb... URL which will require authentication on FE. On port 443 (APS) we can do that by supplying token into replication document and CouchDB will use it to pass in HTTP replication request. While on port 8443 (XPS) we can not do it since stock CouchDB does not support x509 authentication protocol. For more technical details how to use replication document and APS/XPS please refer to this gist.

I do not know if anything can be done in this ticket. Please let me know your opinion on this.

amaltaro commented 1 year ago

@vkuznet in addition to my comment here: https://github.com/dmwm/WMCore/issues/11068#issuecomment-1253706596

I'd suggest you to have a look at a CouchDB configuration that actually does replication with x509 certificates, please look at this file (especially the ssl and replicator sections):

cmst1@vocms0263:/data/srv/wmagent/current $ vim config/couchdb/local.ini

CMSWEB-based CouchDB does not start any database replication, that's why it does not define those configuration sections.

vkuznet commented 1 year ago

@amaltaro , you are pointing to RPM based CMS version of couchdb. My point is the does the stock CouchDB (from CouchDB docker image) support x509? I do not know, but I can apply the configuration and see how it will go.

vkuznet commented 1 year ago

ok, happy to report that x509 replication now works, see update in my gist section Replication using x509. To get proxy cert we need to automate its creation on testX clusters, @muhammadimranfarooqi could you please do the following:

vkuznet commented 1 year ago

@amaltaro , @goughes , @todor-ivanov Here is corresponding commit to local.ini in test branch https://gitlab.cern.ch/cmsweb-k8s/services_config/-/commit/33020595ee4695a8243f0043954d22fcb62efe5a which is required to make x509 replication works in testX clusters. Please check it and make necessary changes.

vkuznet commented 1 year ago

I put corresponding ticket (CMSKUBENETES-183) to CMSWEB group to install corresponding crons for proxy/token access in couchdb namespace. Meanwhile, I added Referer to auth-proxy-server codebase via the following https://github.com/dmwm/auth-proxy-server/commit/64507d8e6a00e1c90995f4abed65b8d3494442d5 Once I"ll get new aps image and test it, it will be applied to all testX clusters

amaltaro commented 1 year ago

@vkuznet I just updated the CMSKubernetes JIRA ticket. Given that we do not trigger replications from central CouchDB, I'd rather keep things as simple as we can and not have extra pods/crons performing actions that we don't really need. That means, we should revert the services_config change that you made.

Now that you tested it between central CouchDB instances, I see no reason why it wouldn't work between WMAgent and central CouchDB. This is still to be verified though.

vkuznet commented 1 year ago

Alan, this has nothing to do with central CouchDB. I concentrated to verify if replication works in k8s setup with APS/XPS which is what we'll end-up anyway. As such the work will be completed once we'll have proxy/tokens in place for all namespaces, including couchdb and proper changes to APS/XPS. There is not overhead and complications since testX clusters are setup for dev-groups with everything they need to do the work, as such proxy/tokens should be available in all namespaces. Said that, if you do not need ssl/replicator sections for local.ini in test branch I can easily remove these sections, but at least now I know that everything works in k8s setup and it allow WMCore team to move forward with APS/XPS migration for FE. Will you use replication in testX clusters is totally up to the team/users of that cluster.

amaltaro commented 1 year ago

Exactly my point. Now that you managed to verify it, we no longer need to have any of the special tweaks in the dev clusters. Honestly speaking, start a dev cluster from the very beginning is a bit annoying because there are a few details that need to be considered:

If we don't need to use this functionality, we better not even require it to be dealt with when we are working in the dev environment. My previous comment explain why we do not need it.

vkuznet commented 1 year ago

I think you misunderstand what I was saying. When CMSWEB operator creates testX clusters we have default namespaces and set of tasks which CMSWEB scripts do. These include:

Therefore, we do not require dev-team to create namespace, creates proxy/token. This will be part of default setup of the clusters. As such it will be less work on your side. What WMCore team will do is only

vkuznet commented 1 year ago

Meanwhile, I fixed/tested APS/XPS/SPS to setup Referrer, and requested to update them in k8s clusters via this ticket: https://its.cern.ch/jira/browse/CMSKUBERNETES-184 Once we'll deploy new version of APS/XPS/SPS they will have everything dev-team need to do their work with token and x509 authentication, including CouchDB replication (if it will be ever required).