Test latest MongoDBAsAService cluster for preprod and prod

todor-ivanov commented 1 year ago

Impact of the new feature Any WMcore service that is using MongoDB

Is your feature request related to a problem? Please describe. After the MongoDBAsAService failure we had in production last month, CSWWEB Team approached the hard task of properly distributing all replicasets on different pods and assigning every pod to a separate node in the cluster for both prod and preprod clsuters. This way we have the best possible redundancy and we can benefit from MongoDB's internals for improving service availability.

Here is the Jira ticket which we use to discuss the cluster setup and operations https://its.cern.ch/jira/browse/CMSKUBERNETES-175

Describe the solution you'd like The current WMCore issue is to track our tests and to provide final results and conclusion from our side. Plus, the change in the services configuration for the new entry points.

Describe alternatives you've considered No alternatives. This is a MUST DO issue.

Additional context None

arooshap commented 1 year ago

Hi @todor-ivanov and @amaltaro,

do you have any updates regarding this ticket?

Thanks.

amaltaro commented 1 year ago

@arooshap Hi Aroosha, Todor is on vacation this week and he said he would resume these tests in the beginning of the next week.

todor-ivanov commented 1 year ago

Hi @amaltaro @arooshap,

I did resume those tests. I can see there are 3 pods running on those clusters (supposedly each of them serving a separate replica instance from the replicaset):

kubectl get pods -o wide

NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE                                  NOMINATED NODE   READINESS GATES
mongodb-0-5c8b9f9c99-rm2vk   1/1     Running   3          18d   10.100.118.167   mongodb-preprod-ayi2iem2z5l3-node-0   <none>           <none>
mongodb-1-56bcfccfdb-jzxt7   1/1     Running   0          18d   10.100.123.226   mongodb-preprod-ayi2iem2z5l3-node-1   <none>           <none>
mongodb-2-7d7c4b49f9-cvggq   1/1     Running   0          18d   10.100.89.65     mongodb-preprod-ayi2iem2z5l3-node-2   <none>           <none>
mongosh                      1/1     Running   0          22d   10.100.118.159   mongodb-preprod-ayi2iem2z5l3-node-0   <none>           <none>

But I am currently facing some issues figuring out the correct connection string to these new clusters. Here is what I get:

(WMCore.venv3) [user@unit02 WMCore]$ ipython -i -- bin/adhoc-scripts/mongoInit.py  -c $WMCORE_SERVICE_CONFIG/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2023-02-08 13:32:26,169:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2023-02-08 13:32:26,174:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig:
{'connect': True,
 'create': False,
 'database': 'msOutputDBPreProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '****',
 'port': None,
 'replicaSet': 'cmsweb-test',
 'server': ['cms-mongo-preprod.cern.ch:32001',
            'cms-mongo-preprod.cern.ch:32002',
            'cms-mongo-preprod.cern.ch:32003'],
 'username': '****'}

....
ServerSelectionTimeoutError: No replica set members available for replica set name "cmsweb-test", Timeout: 30s, Topology Description: <TopologyDescription id: 63e3965a7f9ffb7bcd125b44, topology_type: ReplicaSetNoPrimary, servers: []>

Maybe @arooshap may take a look at the connection parameters I've used from above, and correct me at some place if I guessed them wrong.

arooshap commented 1 year ago

Hello @todor-ivanov, can you please try with'replicaSet': 'rsName' for MongoDB preprod. And for MongoDB prod, it is 'replicaSet': 'mongodb-prod'

todor-ivanov commented 1 year ago

Hi @arooshap, As widely discussed in another channel, I did test the current configuration and it works by changing the replicaset name, but we have some additional minor changes to make in this setup. lets repeat the test once we are ready with those too. I will post the results here as well. Thanks once again.

todor-ivanov commented 1 year ago

Hi @arooshap, There is still something not quite clear with this setup. Here is the output from my latest test:

(WMCore.venv3) [user@unit02 config]$ ipython -i -- $WMCORE_SERVICE_SRC/bin/adhoc-scripts/mongoInit.py  -c $WMCORE_SERVICE_CONFIG/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2023-02-21 11:33:13,146:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2023-02-21 11:33:13,150:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig:
{'connect': True,
 'create': False,
 'database': 'msOutputDBPreProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '****',
 'port': None,
 'replicaSet': 'rsName',
 'server': ['cms-mongo-preprod-node-0.cern.ch:32001',
            'cms-mongo-preprod-node-1.cern.ch:32002',
            'cms-mongo-preprod-node-2.cern.ch:32003'],
 'username': '****'}
2023-02-21 11:33:13,186:ERROR:MongoDB:_dbTest(): Missing MongoDB databases: msOutputDBPreProd
2023-02-21 11:33:13,186:ERROR:MongoDB:_dbConnect(): Could not connect to a missing MongoDB databases: msOutputDBPreProd

In [1]: 

In [1]: 

In [1]:  mongoClt.address
Out[1]: ('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32001)

In [2]: mongoClt.nodes
Out[2]: 
frozenset({('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32001),
           ('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32002),
           ('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32003)})

Now it seems the proper DNS aliases has been set. But it seems there is only one node (mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch) participating in this replicaSet, even though our configuration says we do try to connect to the 3 of them:

data.mongoDBServer = ['cms-mongo-preprod-node-0.cern.ch:32001','cms-mongo-preprod-node-1.cern.ch:32002', 'cms-mongo-preprod-node-2.cern.ch:32003']

What I'd expect to see in this setup would be 3 different nodes participating in the replicaset. i.e:

...
In [2]: mongoClt.nodes
Out[2]: 
frozenset({('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32001),
           ('mongodb-preprod-ayi2iem2z5l3-node-1.cern.ch', 32002),
           ('mongodb-preprod-ayi2iem2z5l3-node-2.cern.ch', 32003)})

FYI @amaltaro

arooshap commented 1 year ago

Hi @todor-ivanov, I have updated the rsName to mongodb-preprod, as you requested. It required reconfiguring the whole setup from scratch, but I kept the credentials the same.

Coming to the current issue, I am not very familiar with how the MongoDB configuration takes place, but it might be due to the fact that only the primary replica set is getting selected out of the three (because only the primary replicaset receives all read-write operation). We can test this by purposefully making the primary node fail somehow.

What do you think?

todor-ivanov commented 1 year ago

Hi @arooshap, This:

but it might be due to the fact that only the primary replica set is getting selected out of the three (because only the primary replicaset receives all read-write operation)

does not sound to be the case here. And here is what this replicaset is built from:

In [16]: mongoClt.topology_description.readable_servers
Out[16]: 
[<ServerDescription ('mongodb-prod-l2toodvtflw2-node-0.cern.ch', 32003) server_type: RSSecondary, rtt: 0.0016034678318615225>,
 <ServerDescription ('mongodb-prod-l2toodvtflw2-node-0.cern.ch', 32001) server_type: RSPrimary, rtt: 0.00900564936684487>,
 <ServerDescription ('mongodb-prod-l2toodvtflw2-node-0.cern.ch', 32002) server_type: RSSecondary, rtt: 0.001402569168591702>]

It looks like... again the K8 over flexibility is playing a role here - one pod exposing one and the same service 3 times based on port. Which would not do good for service redundancy in this case. I thought we have overcome this already.

About:

We can test this by purposefully making the primary node fail somehow.

Trying to forcefully reconnect to a second replicaset member does not sound to be the best testing path here. I am actually afraid of ending on a completely independent replicaset instance here instead of another replicaset member (because to me this replicaset does not include any member outside this pod: mongodb-prod-l2toodvtflw2-node-0.cern.ch. Such configuration in the long term would cause data integrity issues.

I'd suggest to completely stop the NodePort Service at K8 instead, and stop redirecting those 3200* ports and let all replicaset members participate with the default MongoDB port of 27017. This would be the only way to be sure that not only the K8 cluster is configured to have 3 different instances of MongoDB, but also the replicaset is configured properly and all those 3 pods do participate in one and the same replicaset properly.

arooshap commented 1 year ago

@todor-ivanov I will test this setup, and keep you posted about it.

todor-ivanov commented 1 year ago

thanks @arooshap

vkuznet commented 1 year ago

@todor-ivanov , you need to understand the k8s constrains. The NodePort is a mechanism to expose a service on a specific port, but k8s only allow ports in 30000-32000 range. Therefore, if you want to use direct port I doubt it is possible on k8s via NodePort. The NodePort mechanism will map the chosen NodePort to service port, they may be identical but it will still be a mapping and internal routing by k8s.

I already lost in this entire discussion, but we only have few options:

use ingress controller to specific node:port in k8s namespace, but for that we need explicit end-point, e.g.
- /replica1 will point to pod:port of replica one
- /replica2 will point to pod:port of replica two
or, we should define redirect rules in FE
or, we can use NodePort which will provide internal redirect.

Everything depends on how MongDB will be setup, if you setup it via operator, then there is no much leverage here is operator defines its setup. If you wan to set it up as individual pods, then we'll need individual manifest files where we'll specify --replica option somehow. But again, the access to them will be defined by rules I outlined above.

Bottom line, please define which rule you want to follow, if you want explicit host:port access then I think we should still use NodePort approach but we need to change how we deploy MongoDB to have three separate pods running on separate nodes (which can be accommodated via label) and use specific flags at start-up to define replica behavior.

todor-ivanov commented 1 year ago

Hi @vkuznet we've been through this with @arooshap and @muhammadimranfarooqi already. The communication was long and was happening through Mattermost. The path has been chosen already and Arosha did set labels for those pods so that each of those 3 are always assigned to a separate node (and IIUC the node should always be the same).

The last bit that we had to solve were the long pod names for which she did set 3 separate DNS aliases (which was the only purpose of this last test). But I can see now the replicaset itself is not properly configured. It seems to me though, it was properly set in the past - I think I've seen those three pods participating in the replicaset properly during one of those tests, but somehow this configuration got lost on the way maybe.

Actually what we need to have as an end result is nothing that K8 cannot achieve. Here are some guidelines already used by Panos as well for the initial setup: [1].

What I was advocating for in my previous message, was to completely get rid of the NodePort service here, it is obsolete now because we have already attached one replica per pod and assigned one pod per node in th k8 cluster. Same for FE rules - We need not ingress policies since we should not use any HTTP based redirects etc. ...

Instead, what we'd need for exposing the service, to me seems to be, just using the node's external ip directly, and since we have already a pod per node statically allocated (thanks @arooshap for solving that) that should suffice. According to K8 documentation this is absolutely possible [2].

[1] https://medium.com/swlh/how-to-setup-mongodb-replica-set-on-kubernetes-in-minutes-5c1e7fd5b5f3

[2] https://kubernetes.io/docs/concepts/services-networking/service/#external-ips

todor-ivanov commented 1 year ago

hi @vkuznet ,

Just wanted to ask something yesterday but I forgot:

k8s only allow ports in 30000-32000 range

Is this a constraint enforced by Kubernetes itself, or is it something which stems from our setup (the way how we have chosen to distribute service ports in CMS)?

vkuznet commented 1 year ago

It is k8s constrain and we can't change it.

Sent from Proton Mail mobile

-------- Original Message -------- On Feb 22, 2023, 7:43 AM, Todor Ivanov wrote:

hi @.***(https://github.com/vkuznet) ,

Just wanted to ask something yesterday but I forgot:

k8s only allow ports in 30000-32000 range

Is this a constraint enforced by Kubernetes itself, or is it something which stems from our setup (the way how we have chosen to distribute service ports in CMS)?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

arooshap commented 1 year ago

(This is just a copy paste of the discussion that we had on mattermost. )

@todor-ivanov, yes I have had the time to look at your comments from yesterday. Our rs.config() looked like this:

mongodb-preprod:PRIMARY> rs.config()
{
        "_id" : "mongodb-preprod",
        "version" : 1,
        "term" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch:32001",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "secondaryDelaySecs" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch:32002",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0.9,
                        "tags" : {

                        },
                        "secondaryDelaySecs" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch:32003",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0.5,
                        "tags" : {

                        },
                        "secondaryDelaySecs" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "catchUpTimeoutMillis" : -1,
                "catchUpTakeoverDelayMillis" : 30000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("63f4a37aac99ab9f0fd66480")
        }
}

From the article you provided, they expose the service to only one port, which is 27017. I can try to replicate the setup and let you know if I am able to replicate it. And this also highlights another issue, which is the fact that only node-0 is being reflected in the hostname. This implies that we will have to make adjustments in our helm chart to accommodate the other nodes as well. At the moment, I have changed this configuration using rs.reconfig to accommodate different node names.

You had multiple suggestions in this regard, so I think it is a better idea to outline them in action steps that we can take.

arooshap commented 1 year ago

@todor-ivanov I would also like to comment on some of the points that you highlighted:

is there a way for me to take a look how this whole configuration is created
last time I realised I cannot look at any of the configuration files for this cluster

We are not applying any additional configurations at the runtime. That link that I shared only had the KUBECONFIG and passwords. Here is the link to the docker image that is currently deployed: https://registry.cern.ch/harbor/projects/1771/repositories/cmsmongo/artifacts-tab/artifacts/sha256:092b0d39fc382c2a57d2b4c416042a0118d1c82777beaf0cdf475535125b2662. So the configuration is in fact getting applied at the deployment time, not separately.

we should try to create /etc/mongod.conf file in advance as advised here: https://www.mongodb.com/docs/manual/tutorial/deploy-replica-set/#configuration

When we deploy the helm chart, it is supposed to do the following:

Schedule Pods to Node
Start Persistent Volume and attach Pod to Persistent Volume via Persistent Volume Claim
Pull image from DockerHub
Start Containers
Deploy prometheus, prometheus-adapter, kube-eagle and mongodb-exporter tools for monitoring
Start Mongod and initiate ReplicaSet
Create usersAdmin and clusterAdmin users

And the command that you mentioned gets applied for each pod, like this:

root@mongodb-0-84d6cfb866-zhmrr:~# cat startup-mongo-0.sh
#!/bin/bash

mkdir -p /data/db/rs-0
export POD_IP_ADDRESS=$(hostname -i)

/root/reconfig-mongo-rs.sh &

mongod --replSet $RS_NAME --port 27017 --bind_ip localhost,$POD_IP_ADDRESS --dbpath /data/db/rs-0 --oplogSize 128 --keyFile /etc/secrets/mongokeyfile
root@mongodb-0-84d6cfb866-zhmrr:~# cat startup-mongo-1.sh
#!/bin/bash

mkdir -p /data/db/rs-1
# /root/initialize-users.sh &
export POD_IP_ADDRESS=$(hostname -i)
mongod --replSet $RS_NAME --port 27017 --bind_ip localhost,$POD_IP_ADDRESS --dbpath /data/db/rs-1 --oplogSize 128 --keyFile /etc/secrets/mongokeyfile

root@mongodb-0-84d6cfb866-zhmrr:~# cat startup-mongo-2.sh
#!/bin/bash

mkdir -p /data/db/rs-2
# /root/initialize-users.sh &
export POD_IP_ADDRESS=$(hostname -i)
mongod --replSet $RS_NAME --port 27017 --bind_ip localhost,$POD_IP_ADDRESS --dbpath /data/db/rs-2 --oplogSize 128 --keyFile /etc/secrets/mongokeyfile
root@mongodb-0-84d6cfb866-zhmrr:~#

and we should put the proper replicaset configuration there taking into account the static configuration for the machines participating in this cluster (names ports etc.) (preferably only on 27017 as suggested on the document itself then in the pods we should follow the initialisation procedure as described in this bullet: https://www.mongodb.com/docs/manual/tutorial/deploy-replica-set/#procedure

Apart from the port number, correct me if I am wrong, but this is exactly what happens once the pods are deployed in the MongoDB primary instance.

so the run script you were pointing to needs to do only rs.initiate

I am not sure exactly what you mean by this, but when the rs.initiate() command gets executed, it doesn't have a static configuration. It fetches the values given at the deployment time, no?

and pay attention to the following: https://www.mongodb.com/docs/manual/tutorial/deploy-replica-set/#initiate-the-replica-set

there is a note stating that the replicaset init should happen from one and only one replica member

(otherwise I beleive we risk to set 2 competing primary members in the replicaset - have not checked the detailes yet.. but won't be surprised this to be the case) which means the run script should check if this has not been already done by any other of the pods

Yes, this restriction is there, and only the primary replicaset can actually do that. If the replicaset is not primary, it simply says that I am not the master, therefore, I cannot execute this script.

arooshap commented 1 year ago

I am pinging @Panos512 to share his insight in this regard as well.

todor-ivanov commented 1 year ago

Hi @arooshap In our earlier conversation you did not provide those startup-mongo-*.sh scripts, but rather the following snippet:

#!/bin/bash

echo "Executing initialize-mongo-rs.sh"

mongo --eval "mongodb = ['$NODE_HOSTNAME:32001', '$NODE_HOSTNAME:32002', '$NODE_HOSTNAME:32003'], rsname = '$RS_NAME'" --shell << EOL
cfg = {
        _id: rsname,
        members:
            [
                {_id : 0, host : mongodb[0], priority : 1},
                {_id : 1, host : mongodb[1], priority : 0.9},
                {_id : 2, host : mongodb[2], priority : 0.5}
            ]
        }
rs.initiate(cfg)
EOL

Which leads exactly to the wrong replica set configuration you pasted in your comment here https://github.com/dmwm/WMCore/issues/11450#issuecomment-1440088365 And which is exactly the source of the problem.

During our conversation, you referred to this script as the initialize-mongo-rs.sh at each pod. So my understanding here is that this is the script which, once mongod has been started at the pod, constructs and initiates the replicaset. I have no way to check what is the run sequence of those scripts...., but if that is the case sot then what happens at runtime is:

`startup-mongo-*.sh`

the pod starts mongod
specifies the replica set name each instance/pod to be associated with
binds this to port 27017 on both localhost and the pod's ip address (as expected it is the internal ip not the external one associated with the node)

initialize-mongo-rs.sh

The pod assembles a replica set by running the the mongo command locally at the pod and trying to refer to the replica set members through the node hostname where the pods are supposed to be running.

But distributing this initialize-mongo-rs.sh script in the form as written here does something wrong - it actually assembles a replicaset out of 3 members all of them pointing to one and the same node but exposed 3 times on 3 different ports (notice the variable $NODE_HOSTNAME which refers to the node where only the current pod is running). So regardless of how one have constructed the Kubernetes cluster (like assigned pods to nodes .... exposed services to ports etc. ...) you end up with 3 separate instance of 3 completely independent replica sets, each of them constructed out of 3 members, each member running on one and the same node and exposed 3 times on 3 different ports at the same node. And this is exactly what we observe at the end of the day when you run the command rs.config(). This I believe is the error here.

Apart from the port number, correct me if I am wrong, but this is exactly what happens once the pods are deployed in the MongoDB primary instance.

I do not know what is happening, I can observe only the final result. And from this, what I can conclude/guess is the sequence I just wrote above.

If the replicaset is not primary, it simply says that I am not the master, therefore, I cannot execute this script.

Where is that happening?

todor-ivanov commented 1 year ago

thanks @vkuznet

It is k8s constrain and we can't change it.

Well ok, even if we decide to continue using NodePort (in case tying the cluster configuration to the pods' ip addresses is considered too bad), the proper action would be to use only one port to map to from all pods (e.g. 32017), that would be an equivalent of just exposing port 27017 from every instance.

todor-ivanov commented 1 year ago

Hi @arooshap ,

let me repeat portions of our later observations here, just for the log. It was confirmed this https://github.com/dmwm/WMCore/issues/11450#issuecomment-1440307941 is indeed what is happening, and in addition, we now know the initialize-mongo.sh script in the current setup is to be run manually. So most of our findings and observations from the past also show that such a replica set could not survive a pod restart. So we need to take actions on making this setup permanent. Here are the things I was suggesting in our MM conversation to be checked and fixed: Basically 3 things are left to be done in my opinion:

Move all pods to expose the service on one and the same port (could be 32017 for convinience)
Move the replica set and (possibly users) configuration in a /etc/mongo.conf file, which needs to be used when mongod is started - this file needs to be properly uploaded and maintained through the config repositories we have
Fixing the script that creates the replicaset to do only rs.initiate(cfg) and we eventually get rid of any step that needs to be run manually for those clusters' deployment

There are few additional/preventive steps that we should also consider:

Update of the documentation which lists all steps for deploying such a cluster [here]
We MUST double check if every pod (instance of mongod/replica member) mounts and uses it's won volume for the database replica - because we do need to have 3 independent data copies for this setup and we must force pods to use its own copy and prevent the 3 of them pointing to the same volume at any moment in time - that would be a disaster.

And while looking on how to achieve such a permanent setup, I stumbled on a very well prepared document/blog post from MongoDB users on how MongoDB should be run as a Service on a Kubernetes cluster: [1] Need to mention, it overlaps at 99.9% on what we try to achieve here.

[1] https://www.mongodb.com/blog/post/running-mongodb-as-a-microservice-with-docker-and-kubernetes

p.s. Just to confirm here - once Aroosha manually set all the corrections to the scripts and variables we did manage to achieve the desired result of building the replica set out of 3 members each living on a separate pod, attached to a separate node, with it's own public ip address:

2023-02-24 16:22:02,632:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig:
{'connect': True,
 'create': False,
 'database': 'msOutputDBPreProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '****',
 'port': None,
 'replicaSet': 'mongodb-preprod',
 'server': ['cms-mongo-preprod-node-0.cern.ch:32001',
            'cms-mongo-preprod-node-1.cern.ch:32002',
            'cms-mongo-preprod-node-2.cern.ch:32003'],
 'username': '****'}
2023-02-24 16:22:02,669:ERROR:MongoDB:_dbTest(): Missing MongoDB databases: msOutputDBPreProd
2023-02-24 16:22:02,669:ERROR:MongoDB:_dbConnect(): Could not connect to a missing MongoDB databases: msOutputDBPreProd

In [1]: mongoClt.nodes
Out[1]: 
frozenset({('mongodb-preprod-ayi2iem2z5l3-node-0.cern.ch', 32001),
           ('mongodb-preprod-ayi2iem2z5l3-node-1.cern.ch', 32002),
           ('mongodb-preprod-ayi2iem2z5l3-node-2.cern.ch', 32003)})

So what is left for making things permanent is indeed the plan outlined above.

arooshap commented 1 year ago

Hi @todor-ivanov, thanks for such a detailed review. I will try to make changes to the deployment with the points that you mentioned.

Also, following the blog post that you mentioned above, I tested the whole setup in the new mongodb-prod cluster. If you want to have a look at how it is configured, you can have a look at the helm chart here, and you can also see the deployed setup(in the new mongodb-prod) cluster. I still need to make a few changes to have the replica set fully configured.

The deployment currently looks like this:

[apervaiz@lxplus8s16 docker]$ k get all
NAME                  READY   STATUS    RESTARTS   AGE
pod/mongo-rc0-mb7qg   1/1     Running   0          71m
pod/mongo-rc1-lpjdr   1/1     Running   0          71m
pod/mongo-rc2-vhjkn   1/1     Running   0          71m
pod/mongosh           1/1     Running   0          38d

NAME                              DESIRED   CURRENT   READY   AGE
replicationcontroller/mongo-rc0   1         1         1       71m
replicationcontroller/mongo-rc1   1         1         1       71m
replicationcontroller/mongo-rc2   1         1         1       71m

NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)           AGE
service/headless-svc        ClusterIP      None             <none>           27017/TCP         71m
service/kubernetes          ClusterIP      10.254.0.1       <none>           443/TCP           45d
service/mongodb-0-service   LoadBalancer   10.254.159.138   137.138.226.49   27017:30062/TCP   71m
service/mongodb-1-service   LoadBalancer   10.254.194.134   137.138.226.73   27017:32526/TCP   71m
service/mongodb-2-service   LoadBalancer   10.254.189.116   137.138.226.72   27017:30870/TCP   71m

arooshap commented 1 year ago

Hi @todor-ivanov, I wanted to test your proposed setup before making any comments, but unfortunately, we cannot have multiple pods exposed using the same NodePort. I replicated the whole setup that you suggested, and when I installed the helm chart, I get this error:

Error: INSTALLATION FAILED: failed to create resource: Service "mongodb-1-service" is invalid: spec.ports[0].nodePort: Invalid value: 32107: provided port is already allocated

The only way to go about your request is to have a load balancer for every service -- a setup just like the one you shared.

What are your suggestions now?

Panos512 commented 1 year ago

Hi all, Sorry, it took me a while to catch up with the thread :)

I understand the problem with replicas ending up at the same node has been fixed by changing the initialize-mongo-rs.sh script to be aware of the different nodes.

How this works is that by default each pod on either creation or restart runs the corresponding startup-mongo-* script. Pod 0 runs startup-mongo-0, pod1 startup-mongo-1 etc.

This looks like that:

#!/bin/bash

mkdir -p /data/db/rs-0
export POD_IP_ADDRESS=$(ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
/root/reconfig-mongo-rs.sh &
mongod --replSet $RS_NAME --port 27017 --bind_ip localhost,$POD_IP_ADDRESS --dbpath /data/db/rs-0 --oplogSize 128 --keyFile /etc/secrets/mongokeyfile

It starts the mongo service in the pod and runs the reconfig-mongo-rs.sh script. https://github.com/Panos512/MongoDB-ReplicaSet-on-K8s/blob/master/source/startup-script-mongo/reconfig-mongo-rs.sh This script starts, waiting for mongo to be setup in the pod and then runs the initialize-mongo-rs.sh script which is responsible for configuring mongo in the node.

https://github.com/Panos512/MongoDB-ReplicaSet-on-K8s/blob/master/source/startup-script-mongo/initialize-mongo-rs.sh

#!/bin/bash

echo "Executing initialize-mongo-rs.sh"

mongo --eval "mongodb = ['$NODE_HOSTNAME:32001', '$NODE_HOSTNAME:32002', '$NODE_HOSTNAME:32003'], rsname = '$RS_NAME'" --shell << EOL
cfg = {
        _id: rsname,
        members:
            [
                {_id : 0, host : mongodb[0], priority : 1},
                {_id : 1, host : mongodb[1], priority : 0.9},
                {_id : 2, host : mongodb[2], priority : 0.5}
            ]
        }
rs.initiate(cfg)
EOL

/root/initialize-users.sh &

This script is passing the config to the node and then runs the initialize-users script: https://github.com/Panos512/MongoDB-ReplicaSet-on-K8s/blob/master/source/startup-script-mongo/initialize-users.sh which creates the users if they are not there.

In principle all this should be automatic, we shouldn't run those manually, unless some pod is stuck for some reason. If a pod restarts it should again run mongo and load the default configuration automtically. We have seen this not happening a few times in the past, so I agree we should investigate this a bit further.

I hope this cleans up the chain of commands.

@todor-ivanov regarding your point about making sure that each pod writes to a different storage area. This is foreseen in the helm chart. Each pod has it's own unique PVC (Persistent Volume ClaimO which means that each pod ends up mounting it's own volume. We should be fine in this regard :)

Panos512 commented 1 year ago

Sorry I submitted my answer mid-writing it :D

In terms of connectivity: For NodePorts unfortunately the way this works is that it opens a port in all the nodes for communication with a specific pod. This means that we can't use the same port for multiple pods.

Isn't it OK to expose 3 ports, as we do up to now? 32001 for pod-1 32002 for pod-2 and 32003 for pod-3. We can then put a loadbalancer in front of the three nodes, with a fixed endpoint (let's say cms-mongo.cern.ch) and then we should have the nice connection string of nodes=['cms-mongo.cern.ch:32001', 'cms-mongo.cern.ch:32002', 'cms-mongo.cern.ch:32003']

This should point to the 3 replica endpoints, behind the scenes. In case we add a fourth node, or one dies we are still covered. what do you think?

todor-ivanov commented 1 year ago

hi @Panos512 ,

Getting rid of the Loadbalancer was exactly the starting point of re-configuring these clusters.

todor-ivanov commented 1 year ago

Hi Aroosha,

During the O&C week we quickly talked with Panos, and I asked him directly about how feasible it is to use the option of direct External IP redirection for exposing the service from inside every pod [1], provided we are binding all the 3 different pods to 3 different nodes, hence external ip addresses. He promised to have a quick chat with CERN IT experts on the topic and then we decide if we should take this path, or stick to the NodePort configuration.

In the meantime could you please make a PR to https://github.com/dmwm/CMSKubernetes with the latest changes related to this cluster and link them in the current issue. Do not bother if they are not in final state, we can work out the details. But we do need to know what is left to be worked on, because otherwise we cannot see where the current cluster configuration stands. What I am mostly interested to see is:

We do have the pods to nodes mapping statically configured and preserved in the configuration repositories somewhere
We have the issue with the script for creating all replicaset members inside the same node fixed
We do preserve all the cluster configurations statically in our repositories and we rely on no extra/manual operator's actions during cluster initiation.

We really do hope, we can solve this issue in the next day or two.

Panos512 commented 1 year ago

Hi @todor-ivanov thanks for clearing out the requirements for me :) I had a chat with Spyros from the k8s team today and I think we have a pretty good solution.

We could use externalIPs as we initially discussed but Spyros suggested an even more basic and straight forward solution:

We can use the host network to access the pods directly without the usage of k8s services. This way we will have direct access to the pods through the 3 nodes, without k8s handling any of the routing.

We should have each of the 3 pods forced to run in a different node and configure them to use the host network:

hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

We can then attach an alias to each node so that it has a constant name and use a connection string like: cms-mongo-replica-1.cern.ch:27017,cms-mongo-replica-2.cern.ch:27017,cms-mongo-replica-3.cern.ch:27017

If a node dies we just have to provision a new one and attach the alias to it, client's won't suffer since mongo will keep working with two nodes handling the routing/election as it should do.

How does this sound?

arooshap commented 1 year ago

Thanks @Panos512. This is indeed a really good alternative. If @todor-ivanov agrees with this setup, I will try to implement it today or tomorrow.

todor-ivanov commented 1 year ago

Hi @Panos512 this is actually exactly what is needed. Sounds perfect! Thanks a lot!

@arooshap I see a new PR created for updating the K8 configuration at: https://github.com/dmwm/CMSKubernetes/pull/1336#issuecomment-1479380189 Fixing the issue with replicaset recreation on the same node. Thanks for that! The PR does not seem to include the configuration changes suggested by Panos and adressingo only the proper distribution/assignment of pods to nodes (which is indeed great). Are you you planning to add the rest in a separate Pr or to use the same one? And did you deploy the so suggested configuration in testbed so we can try it?

arooshap commented 1 year ago

Hi @todor-ivanov, you can see these changes in the mongodb-preprod cluster that we have.

For the changes that Panos suggested, I will open a separate PR tomorrow if that is okay for you after I am done with the setup.

amaltaro commented 1 year ago

Hi everyone, we had many discussions on this over the past few days and I wanted to write down what the requirements for this service are:

distribute replica members in different nodes, thus ensuring that replicas are not shared by the same kubernetes minion;
use different ceph volumes in each replica member, thus ensuring X replicas of the data
upon restart/reboot/redeployment, preserve the configuration of the replica set. In other words, don't recreate the replica set every time a new POD is started.
client requests (client talking to the MongoDB cluster) should always talk to the primary replica, otherwise write operations to the secondary replica would fail and we would have to setup a retry mechanism on the client side. (regardless whether it uses load balance, nodePort, etc)

Sorry if these requirements are already known by everyone but me, I thought it would be worth it writing it here to set our expectations for this MongoDB cluster.

Please do let me know if I got anything wrong though. Are we confident that these 4 requirements can be addressed in the cluster that we are currently working on?

Panos512 commented 1 year ago

Thanks @amaltaro it's very useful to put all of them in one place. I'm pretty confident all the requirements will be met once we implement what we discussed above. I'll let Aroosha confirm.

Just a side question, at some point I was also working on backups of the volumes, I had something that can be used in place but it needed some small improvements. Are backups also a requirement? I could dig up what I've done back then if neeede :)

arooshap commented 1 year ago

Hi @Panos512, yes, all the above requirements will be fulfilled with the new configuration, but I am not sure about the 3rd port which Alan mentioned. That is, upon redeployment, we can ensure that the configuration will be preserved or not. Because when we uninstall a helm chart, it wipes out the whole deployment. @amaltaro when you say redeployment, do you mean the whole setup or just a single pod?

For the backup, we can also leverage the use of velero. I tested velero to take a backup of the persistent volumes in the mongodb-test cluster and upload the data to AWS s3 bucket, but Alan said that it is not needed since we already have 3 copies of data in the 3 different instances.

Panos512 commented 1 year ago

I think we should cover both cases:

In case the nodes don't change it should be fine, pod dies, a new pod comes up and the old storage is attached to it. The pod should then just connect to the replica set, using the configuring on the permanent storage. No need to reconfigure things.
For cluster change, we might need to manually do a rs.reconfig to pass the new hostnames.

I think we should do an exercise with both scenarios to make sure things work as expected :)

todor-ivanov commented 1 year ago

OK, just an update from the latest tests.

Still on the configuration using NodePort service to expose the database, but so far we may prefer to move faster here, rather than experimenting with yet another setup. The details on how we reconfigure on runtime I've left in the parallel PR from @arooshap at https://github.com/dmwm/CMSKubernetes/pull/1336#issuecomment-1481162129 But as said there we can live with the current setup.

I did test few cases of node failures as explained bellow:

Generic connection to the database but by completely (intentionally) shuffling the ports to nodes in the configuration string:


$ ipython -i -- $WMCORE_SERVICE_SRC/bin/adhoc-scripts/mongoInit.py  -c $WMCORE_SERVICE_CONFIG/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
%reload_ext autoreload
%autoreload 2
2023-03-23 15:11:00,619:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2023-03-23 15:11:00,623:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig:
{'connect': True,
'create': False,
'database': 'msOutputDBPreProd',
'directConnection': False,
'logger': <Logger __main__ (INFO)>,
'mockMongoDB': False,
'password': '****',
'port': None,
'replicaSet': 'mongodb-preprod',
'server': ['cms-mongo-preprod-node-0.cern.ch:32002',
        'cms-mongo-preprod-node-1.cern.ch:32003',
        'cms-mongo-preprod-node-2.cern.ch:32001'],
'username': '****'}
2023-03-23 15:11:00,662:ERROR:MongoDB:_dbTest(): Missing MongoDB databases: msOutputDBPreProd
2023-03-23 15:11:00,662:ERROR:MongoDB:_dbConnect(): Could not connect to a missing MongoDB databases: msOutputDBPreProd

In [1]: mongoClt.address Out[1]: ('cms-mongo-preprod-node-0.cern.ch', 32001)

In [2]: mongoClt.topology_description.readable_servers Out[2]: [<ServerDescription ('cms-mongo-preprod-node-2.cern.ch', 32003) server_type: RSSecondary, rtt: 0.0010132044553756714>, <ServerDescription ('cms-mongo-preprod-node-0.cern.ch', 32001) server_type: RSPrimary, rtt: 0.0011596716940402985>, <ServerDescription ('cms-mongo-preprod-node-1.cern.ch', 32002) server_type: RSSecondary, rtt: 0.0011308901011943817>]


No matter how hard I was trying to fail the connection K8 always managed to redirect me to a node to connect to. And scanning the nodes from outside it is obvious the 3 ports are opened at the 3 hosts used for this setup:

$ nmap -n -p32001,32002,32003 cms-mongo-preprod-node-0.cern.ch Starting Nmap 7.70 ( https://nmap.org ) at 2023-03-23 16:20 CET Nmap scan report for cms-mongo-preprod-node-0.cern.ch (188.185.101.106) Host is up (0.00074s latency).

PORT STATE SERVICE 32001/tcp open unknown 32002/tcp open unknown 32003/tcp open unknown

... $ nmap -n -p32001,32002,32003 cms-mongo-preprod-node-1.cern.ch Starting Nmap 7.70 ( https://nmap.org ) at 2023-03-23 16:20 CET Nmap scan report for cms-mongo-preprod-node-1.cern.ch (188.185.89.240) Host is up (0.00064s latency).

PORT STATE SERVICE 32001/tcp open unknown 32002/tcp open unknown 32003/tcp open unknown

... $ nmap -n -p32001,32002,32003 cms-mongo-preprod-node-2.cern.ch Starting Nmap 7.70 ( https://nmap.org ) at 2023-03-23 16:20 CET Nmap scan report for cms-mongo-preprod-node-2.cern.ch (188.185.126.4) Host is up (0.00067s latency).

PORT STATE SERVICE 32001/tcp open unknown 32002/tcp open unknown 32003/tcp open unknown


* Simulating a service misbehavior or complete hang (e.g. due to resource depletion etc.) by killing `mongod` directly at the primary node: 
   * The pod was immediately restarted. 
   * Meanwhile a new one was elected as primary for the replicaset, and the clients were redirected automatically and the replica set size shrunk accordingly: 

**At the pod directly:**

root@mongodb-1-66976b8986-jj7rt:/root# mongo --quiet --eval "db.isMaster()" { "topologyVersion" : { "processId" : ObjectId("641b40a051e81a7c0a847077"), "counter" : NumberLong(38) }, "hosts" : [ "cms-mongo-preprod-node-0.cern.ch:32001", "cms-mongo-preprod-node-1.cern.ch:32002", "cms-mongo-preprod-node-2.cern.ch:32003" ], "setName" : "mongodb-preprod", "setVersion" : 1, "ismaster" : true, "secondary" : false, "primary" : "cms-mongo-preprod-node-1.cern.ch:32002", "me" : "cms-mongo-preprod-node-1.cern.ch:32002", ...


**At the client:**

In [1]: mongoClt.address Out[1]: ('cms-mongo-preprod-node-2.cern.ch', 32003)

In [2]: mongoClt.topology_description.readable_servers Out[2]: [<ServerDescription ('cms-mongo-preprod-node-2.cern.ch', 32003) server_type: RSPrimary, rtt: 0.001121286302804947>, <ServerDescription ('cms-mongo-preprod-node-1.cern.ch', 32002) server_type: RSSecondary, rtt: 0.0015640966594219208>]

In [3]: mongoClt.is_primary Out[3]: True


* Once the failed pod came back and was reelected as primary again the client  redirected itself to it immediately (completely in the background - no actions taken from my side). And the replica set size was restored back to 3:

In [4]: mongoClt.address Out[4]: ('cms-mongo-preprod-node-0.cern.ch', 32001)

In [5]: mongoClt.is_primary Out[5]: True

In [6]: mongoClt.topology_description.readable_servers Out[6]: [<ServerDescription ('cms-mongo-preprod-node-0.cern.ch', 32001) server_type: RSPrimary, rtt: 0.0065242555654048935>, <ServerDescription ('cms-mongo-preprod-node-2.cern.ch', 32003) server_type: RSSecondary, rtt: 0.0009278316179752353>, <ServerDescription ('cms-mongo-preprod-node-1.cern.ch', 32002) server_type: RSSecondary, rtt: 0.0010554936665248872>]



* I then directly deleted one pod - resulted in exactly the same behavior.  

So all in all we are good to go here. The last bit left to do is during the migration to copy the old database to the new destination before re configuring and restarting the   WMCore services. This is to be coordinated with @arooshap 

FYI: @amaltaro @vkuznet

amaltaro commented 1 year ago

@todor-ivanov thank you for testing it thoroughly.

As a deliverable of this GH, could you please provide the relevant services_config PR with the new connection string, both for preprod and prod (and all the existent WM services that depend on MongoDB).

todor-ivanov commented 1 year ago

hi @amaltaro

As a deliverable of this GH, could you please provide the relevant services_config PR with the new connection string

And here they are: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/197 https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/198 https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/199

todor-ivanov commented 1 year ago

The database migration to the new clusters will be done before the next production deployment in coordination with @arooshap. I am closing the current issue.

FYI: @amaltaro @vkuznet @khurtado

amaltaro commented 1 year ago

@todor-ivanov is there any database migration that needs to be done (for either test, preprod or prod)?

todor-ivanov commented 1 year ago

Hi @amaltaro

is there any database migration that needs to be done (for either test, preprod or prod)?

Yep, for both of them.

dmwm / WMCore

Test latest MongoDBAsAService cluster for preprod and prod #11450

`startup-mongo-*.sh`