Open jsmythsci opened 4 years ago
You're right. The meaning of "breaking" with 7.0 was that the base folders will change behavior, mostly wrt replicas. Variants have been experimental for a while, and I think it's best to run master. I should get round to releasing again, because we've been running dev variants reliably as well as non-root in production since a few months back.
Thank you for the clarification.
Do you have any suggestions or guidance for moving from the "regular" containers to nonroot?
We are using local storage for persistence and I can see that the files there are all owned by root. Would changing ownership to match the userid used in the nonroot containers be sufficient or is there more to it?
As an additional data point, I can see the the volumes for the 3 containers that tried to start after merging v6.0.4 have all had their group ownership changed to "nogroup" and most of the files have group write permissions added.
The only files I see in the kafka persistence directory that do not have group write are snapshot and leader-epoc-checkpoint
files that have been written since the container was reverted back to v6.0.3. On the zookeeper side, it seems to be the same, except that ./data/myid
retains group write permission even after having been updated after rollback.
In case anyone else is following and/or interested in this, it seems that the update failures may not be related to non-root containers after all.
I think the ZK containers are failing to come online because of a known ZK upgrade issue related to missing snapshots (https://issues.apache.org/jira/browse/ZOOKEEPER-3513 and https://issues.apache.org/jira/browse/ZOOKEEPER-3056). I think the updated Kafka container is not coming online because it is timing out while trying to connect to the new ZK containers.
I will post back when I have more information.
I confirmed that switching to non-root containers was actually pretty straight-forward.
I found out that we had introduced an issue in our Kustomization that resulted in DNS lookups failing for the zoo
containers. This didn't seem to keep our original deployment from failing overall because the 3 pzoo
instances all resolved just fine.
After fixing that issue I had to apply the following transformation to zookeeper-config
to get around the issue of the missing snapshots:
kustomization.yaml
:
bases:
- scale-3-5
patchesStrategicMerge:
- zk-trust-empty-snapshot.yaml
zk-trust-empty-snapshot.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: zookeeper-config
data:
zookeeper.properties: |
4lw.commands.whitelist=ruok
tickTime=2000
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
clientPort=2181
maxClientCnxns=2
initLimit=5
syncLimit=2
server.1=pzoo-0.pzoo:2888:3888:participant
server.2=pzoo-1.pzoo:2888:3888:participant
server.3=pzoo-2.pzoo:2888:3888:participant
server.4=zoo-0.zoo:2888:3888:participant
server.5=zoo-1.zoo:2888:3888:participant
snapshot.trust.empty=true
This file just appends snapshot.trust.empty=true
to the existing zookeeper.properties
defined in 10zookeeper-config.yml
.
This change seems to have been all that was required to deploy release 6.0.4 and have it start successfully.
We still have an outstanding issue in that our customization to the broker container command gets lost when using non-root containers but as of now all 3 StatefulSets seem to be working normally in our test environment.
Well, that's strange.
Our Kafka broker startup command --override
s were consistently not getting applied for days.
I tried to fix it by switching from patchesJson6902
to patchesStrategicMerge
and redefining the whole command block, which worked as expected. Then I tried to recreate the issue by backing out that change and I couldn't reproduce the original issue any more. I checked our test environment and confirmed that the expected --override
s were in place so I guess it was just an issue with my local kustomize build
being wonky.
@solsson I see you assigned this to yourself so I will leave it open in case you believe there is work to be done but it seems like my own issues with upgrading to v6.0.4 came down to:
According to README.md, release 7.0 will be a breaking release "with nonroot and native bases" but it seems like nonroot was made the default with c212ea6, which is included in v6.0.4.
When I tried to deploy the non-root containers into a kubernetes cluster that is currently running a deployment based on v6.0.3, all 3 of the first containers (kafka, zoo, pzoo) failed to start and I was forced to roll back the changes.
I could work around this issue by basing my variants off of
../../rbac-namespace-default
,../../kafka
and../../zookeeper
instead of../scale-3-5
but wanted to confirm that the nonroot changes were intentionally included in this release.