Closed JoeHCQ1 closed 6 days ago
confluence.cfg.xml
. This is in the home directory (which is also in the shared home PV). Not text editors are installed in the container so you'll need to copy the file out, edit it, and then cat it back in like:cat > confluence.cfg.xml<< EOF
<the modified file>
EOF
See https://confluence.atlassian.com/conf85/adding-and-removing-data-center-nodes-1283361360.html
If you scale up the nodes with the default chart settings, you'll notice that every connection can return a different node. This makes getting things done neigh impossible. This is because the chart is creating a standard service, instead of a headless one.
See Statefulset#limitations where it's made clear:
You are responsible for creating this Service.
The default service has a ClusterIP that is not None
. This cannot be changed after it's been created to be headless.
It'd seem most direct/simple to use the TCP/IP based clustering rather than multi-cast or the third option I'm forgetting. However, when you scale down and then scale up the statefulset, the IPs for the old pods are not preserved. So, this is unlikely to be a winning strategy. I may keep trying it though to get a working cluster.
Confluence allows you to say that for any clientIP, always send the traffic to the same pod (helm chart setting | k8s docs on session affinity). This would likely work most of the time in an on-prem setting where everyone is on a single LAN. However, once we get any sort of network segmentation this is going to break.
What should be allowed is to set the service's ClusterIP to None
per the docs but that is not exposed in the helm chart.
So, the next step would be to manually create the correct service, see if it works, and then submit a PR upstream.
So the headless service is in use now with a DestinationRule in Istio. Traffic is sticking to the same pod.
However, anytime the pods restart they lose their IP, so TCP/IP based clustering difficult to manage. I can get a cluster of one, and then I can add a second node, and edit the cfg file to include that IP.
However, node-0 wasn't seeming to notice the updated config file. So that was useless. Restarting the node to pick up the changes meant the IPs changed, making the file contents OBE.
There seemed to be a way to setup confluence to automatically read changes to the config file - I will try that now, as the headless service and destination rule seem to have resolved the routing problem.
Correction, the headless service was not in use. If I've understood the DestinationRule properly, it will re-route after k8s has routed the traffic, so the service can be left alone.
I suspect this is why it was headful to begin with. Either operators used session affinity based on clientIP to get what I'm looking for, or used something else not-part-of k8s to get session-aware load balancing.
This is the code we're running with as committed: b5d57cbfb9062bd32cae522d5052f6444c8aba28
So there setting I thought produced auto-refresh on config changes doesn't do that. So confluence will not pick up constant updates to IP address updates.
That said, I could write a script to:
Exec into node-0:
The problem with that approach is that killing the service in a pod should kill the pod. Going to test that.
We could try to support multicast but that has CNI restrictions.
Confirmed that killing the service kills the pod. You can kill it with the command: /shutdown-wait.sh
Incidentally, you can also kill the service in /opt/....
and restart it without killing the pod, but this was all the wrong approach.
The right approach is to change more values.yaml file settings and use Hazelcast to manage the clustering, no click-ops. See https://github.com/atlassian/data-center-helm-charts/issues/555#issuecomment-1653350774
confluence:
hazelcastService:
enabled: true # Required for clustering
clustering:
enabled: true
Also, we can get the licensing handled via IaC
confluence:
license: # Convenience to reduce clickops
secretName: "confluence-license"
secretKey: "license-key"
kubectl create namespace confluence
kubectl create secret generic confluence-license -n confluence --from-file=license-key=confluence_license.txt
Now, I'm getting Hazelcast errors and so failing my readiness probe, but it did register Node-0 as a cluster node in the logs. So there is some clustering capability, I just need to add some missing jars (?) and get Hazelcast the needed K8s API access.
Confluence currently deploys with a single node. Customers will likely require clustered deployments (both customers I know of are deploying to at minimum 1000 daily users). Therefore, a clustered deployment should be the default we're testing against.
Some of the steps required to go from one node to a clustered deployment: