OT-CONTAINER-KIT / redis-operator

A golang based redis operator that will make/oversee Redis standalone/cluster/replication/sentinel mode setup on top of the Kubernetes.
https://ot-redis-operator.netlify.app/
Apache License 2.0
790 stars 217 forks source link

quay.io/opstree/redis-operator images built from wrong commits #527

Open hrobertson opened 1 year ago

hrobertson commented 1 year ago

What version of redis operator are you using?

redis-operator version: 0.14.0

What did you do?

OLM updated redis-operator to 0.14.0 (https://github.com/k8s-operatorhub/community-operators/blob/main/operators/redis-operator/0.14.0/manifests/redis-operator.v0.14.0.clusterserviceversion.yaml#L257)

What did you expect to see?

Image quay.io/opstree/redis-operator:v0.14.0 should have been built from https://github.com/OT-CONTAINER-KIT/redis-operator/tree/v0.14.0 (e86884ead1005484bdb10fb30caf8f8acac2f89b) (February 13th)

What did you see instead? In the v0.14.0 image manifest label com.azure.dev.image.build.sourceversion the source sha is 5e8ac25180a309ccd1f55b379af545479fedeba4 (April 14th) which is not tagged.

This commit includes ff6980f6bd8c1191778cc065b1d18f11f58383a7 which broke updates and has since been reverted. This caused the issue reported here https://github.com/OT-CONTAINER-KIT/redis-operator/issues/526#issuecomment-1597598386

Also:

What is the release process of this operator? Is none of this automated?! If releases are being cut manually what process is in place to ensure it is done correctly?

Thanks

shubham-cmyk commented 1 year ago

This would be closed by v0.15.0 release which is coming in a week

hrobertson commented 1 year ago

Thanks @shubham-cmyk v0.15.0 should resolve the various bugs that were introduced by the v0.14.0 release, but my intent with this issue was to get a little bit more confidence that future releases won't be similarly broken, not due to bugs in code, but due to the wrong commit being released.

Please could you provide a little information about the release process and how it will be ensured that future releases will be taken from the correct commit, and documentation and operatorhub etc will be updated?

Thanks

shubham-cmyk commented 1 year ago

We do have an auto-release mechanism. The volume name change commit was merged in the v0.14.0 by mistake since it was a breaking change it caused a problem. We didn't put the new label on the image after releasing the v0.14.0 because to merge the critical bugs that could arise within a week or so and fix that immediately.

From this mistake, we have learned we would change the tag and would move to v0.15.x after releasing the v0.15.0 where the critical bugs would be addressed immediately and if some breaking change merge won't be a big issue.

The stable tag would also be added to the image from now. To make sure users don't run into an issue. Big-Weekly release of the image might be a good way with specific tags v0.15.x.

Sorry for the inconvenience that has been caused @hrobertson. If you feel any other change/addition feel free to drop a comment

hrobertson commented 1 year ago

I still am not clear what happened because the 0.14.0 tag does not appear to have been pushed to quay until April. But that is not important if a clear release policy is in place for the future.

Since 0.15.0 it looks like you are continuously updating the v0.15.0 image tag to point to new images. This is not normally how a major.minor.patch image tag behaves. Additionally there is no 0.15.0 tag in this git repo.

What I would expect, based on what I see as a common practice in many other repos, is as follows:

A v0.15.0 tag is created in git. Now an automated CI process (GitHub Actions or something else) builds an image from that commit and pushes it to quay with the image tags v0.15.0, v0.15, latest.

Then a new commit is pushed to master. No release is made yet. Then the commit is tagged in git as v0.15.1. Now the CI process builds an image from that commit and pushes it to quay with the image tags v0.15.1, v0.15, latest. Note that the image tag v0.15.0 does not get updated, nor does the v0.15.0 git tag. They should be immutable.

Additionally if 0.15 is not yet "ready" or is a pre-release, have it on it's own branch rather than master and don't tag it as latest.

Finally, a pull request should be automatically raised at https://github.com/k8s-operatorhub/community-operators for stable releases, and https://ot-redis-operator.netlify.app/docs/release-history/ should be updated automatically for all releases.

Thanks

nanderson94 commented 1 year ago

Also to note that fresh installations from OperatorHub appear to be non-functional due to this issue. I have redis-operator.v0.14.0 stuck in the installing state since the redis-operator pod is crash looping:

W0713 16:23:49.036232       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: failed to list *v1beta1.RedisSentinel: redissentinels.redis.redis.opstreelabs.in is forbidden: User "system:serviceaccount:openshift-operators:redis-operator" cannot list resource "redissentinels" in API group "redis.redis.opstreelabs.in" at the cluster scope
E0713 16:23:49.036291       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: Failed to watch *v1beta1.RedisSentinel: failed to list *v1beta1.RedisSentinel: redissentinels.redis.redis.opstreelabs.in is forbidden: User "system:serviceaccount:openshift-operators:redis-operator" cannot list resource "redissentinels" in API group "redis.redis.opstreelabs.in" at the cluster scope
W0713 16:23:49.266930       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: failed to list *v1beta1.RedisReplication: redisreplications.redis.redis.opstreelabs.in is forbidden: User "system:serviceaccount:openshift-operators:redis-operator" cannot list resource "redisreplications" in API group "redis.redis.opstreelabs.in" at the cluster scope
E0713 16:23:49.267228       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: Failed to watch *v1beta1.RedisReplication: failed to list *v1beta1.RedisReplication: redisreplications.redis.redis.opstreelabs.in is forbidden: User "system:serviceaccount:openshift-operators:redis-operator" cannot list resource "redisreplications" in API group "redis.redis.opstreelabs.in" at the cluster scope

Those appear to be new resources in v0.15, and are not granted as part of the v0.14 installation process.


I was able to get past the above errors by defining an extra set of permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: redis-operator-fix
rules:
- apiGroups:
  - redis.redis.opstreelabs.in
  resources:
  - redissentinels
  - redisreplications
  verbs:
  - get
  - watch
  - list
  - update
  - patch
  - create
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: redis-operator-fix
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: redis-operator-fix
subjects:
- kind: ServiceAccount
  name: redis-operator
  namespace: openshift-operators
shubham-cmyk commented 1 year ago

I think your cluster role was not updated. You should update it to the latest version of the helm of the specific operator you are using.

shubham-cmyk commented 1 year ago

@hrobertson Your ideas seem good. I would follow these for that I have to change the current CI this would be done before any further release. The latest release we had is v0.15.0

GroverChouT commented 1 year ago

Hi all, someone has to update https://github.com/k8s-operatorhub/community-operators/blob/main/operators/redis-operator/0.15.0/manifests/redis-operator.v0.15.0.clusterserviceversion.yaml#L195-L199 to add redissentinels and redisreplications otherwise a fresh install by OLM will fail indefinitely.

shubham-cmyk commented 1 year ago

@iamabhishek-dubey Please check this.

Elyytscha commented 1 year ago

also here because fresh install via OLM fails, please update https://github.com/k8s-operatorhub/community-operators/blob/main/operators/redis-operator/0.15.0/manifests/redis-operator.v0.15.0.clusterserviceversion.yaml#L195-L199

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!