JoeHCQ1 commented 3 months ago

Confluence currently deploys with a single node. Customers will likely require clustered deployments (both customers I know of are deploying to at minimum 1000 daily users). Therefore, a clustered deployment should be the default we're testing against.

Some of the steps required to go from one node to a clustered deployment:

The storage directory needs changed from local to the shared home
The underlying PVC must be read-write-many
If synchrony is enabled, and if synchrony is an added service in each confluence pod (likely the default behavior if enabled), then synchrony may also require additional configuration changes
The server setup process is different, as each node is added manually through the UI (starting with node 0, then node 1, etc), and these nodes are either discovered through multicast (requires network changes) or hard-coded IPs entered in the UI.
The network policy to enable an active cluster to speak should exist due to #25, but there is no guarantee they work, until validated.

JoeHCQ1 commented 3 months ago

Dev notes Aug 2 2024

Adding Nodes in TCP/IP Configuration

If you're adding/removing nodes after the initial cluster (in my case of 1) is created, note you have to modify confluence.cfg.xml. This is in the home directory (which is also in the shared home PV). Not text editors are installed in the container so you'll need to copy the file out, edit it, and then cat it back in like:

cat > confluence.cfg.xml<< EOF
<the modified file>
EOF

See https://confluence.atlassian.com/conf85/adding-and-removing-data-center-nodes-1283361360.html

Statefulset connection not stateful

If you scale up the nodes with the default chart settings, you'll notice that every connection can return a different node. This makes getting things done neigh impossible. This is because the chart is creating a standard service, instead of a headless one.

See Statefulset#limitations where it's made clear:

You are responsible for creating this Service.

The default service has a ClusterIP that is not None. This cannot be changed after it's been created to be headless.

JoeHCQ1 commented 3 months ago

On Scale-up Scale-down, IPs are lost

It'd seem most direct/simple to use the TCP/IP based clustering rather than multi-cast or the third option I'm forgetting. However, when you scale down and then scale up the statefulset, the IPs for the old pods are not preserved. So, this is unlikely to be a winning strategy. I may keep trying it though to get a working cluster.

JoeHCQ1 commented 3 months ago

Confluence Does Not Expose enough Service Settings

Confluence allows you to say that for any clientIP, always send the traffic to the same pod (helm chart setting | k8s docs on session affinity). This would likely work most of the time in an on-prem setting where everyone is on a single LAN. However, once we get any sort of network segmentation this is going to break.

What should be allowed is to set the service's ClusterIP to None per the docs but that is not exposed in the helm chart.

So, the next step would be to manually create the correct service, see if it works, and then submit a PR upstream.

JoeHCQ1 commented 3 months ago

So the headless service is in use now with a DestinationRule in Istio. Traffic is sticking to the same pod.

However, anytime the pods restart they lose their IP, so TCP/IP based clustering difficult to manage. I can get a cluster of one, and then I can add a second node, and edit the cfg file to include that IP.

However, node-0 wasn't seeming to notice the updated config file. So that was useless. Restarting the node to pick up the changes meant the IPs changed, making the file contents OBE.

There seemed to be a way to setup confluence to automatically read changes to the config file - I will try that now, as the headless service and destination rule seem to have resolved the routing problem.

JoeHCQ1 commented 3 months ago

Correction, the headless service was not in use. If I've understood the DestinationRule properly, it will re-route after k8s has routed the traffic, so the service can be left alone.

I suspect this is why it was headful to begin with. Either operators used session affinity based on clientIP to get what I'm looking for, or used something else not-part-of k8s to get session-aware load balancing.

JoeHCQ1 commented 3 months ago

This is the code we're running with as committed: b5d57cbfb9062bd32cae522d5052f6444c8aba28

JoeHCQ1 commented 3 months ago

So there setting I thought produced auto-refresh on config changes doesn't do that. So confluence will not pick up constant updates to IP address updates.

That said, I could write a script to:

Get all IPs, order by node number
Exec into all nodes, turn them off in order n->1 w/o killing the pod
Exec into node-0:
- Turn off confluence server
- pull config out
- Patch config
- Cat it back into place
- Turn on confluence server
The problem with that approach is that killing the service in a pod should kill the pod. Going to test that.

We could try to support multicast but that has CNI restrictions.

JoeHCQ1 commented 3 months ago

Confirmed that killing the service kills the pod. You can kill it with the command: /shutdown-wait.sh

JoeHCQ1 commented 3 months ago

Incidentally, you can also kill the service in /opt/.... and restart it without killing the pod, but this was all the wrong approach.

The right approach is to change more values.yaml file settings and use Hazelcast to manage the clustering, no click-ops. See https://github.com/atlassian/data-center-helm-charts/issues/555#issuecomment-1653350774

confluence:
  hazelcastService:
    enabled: true  # Required for clustering
  clustering:
    enabled: true

Also, we can get the licensing handled via IaC

confluence:
  license:  # Convenience to reduce clickops
    secretName: "confluence-license"
    secretKey: "license-key"

kubectl create namespace confluence
kubectl create secret generic confluence-license -n confluence --from-file=license-key=confluence_license.txt

Now, I'm getting Hazelcast errors and so failing my readiness probe, but it did register Node-0 as a cluster node in the logs. So there is some clustering capability, I just need to add some missing jars (?) and get Hazelcast the needed K8s API access.

defenseunicorns / uds-package-confluence

Enable Confluence Clustering by default #26

Dev notes Aug 2 2024

Adding Nodes in TCP/IP Configuration

Statefulset connection not stateful

On Scale-up Scale-down, IPs are lost

Confluence Does Not Expose enough Service Settings