apache / rocketmq-operator

Apache RocketMQ Operator
https://rocketmq.apache.org/
Apache License 2.0
314 stars 127 forks source link

Metrics and Log Printing #50

Closed miaolinjie closed 2 years ago

miaolinjie commented 4 years ago

BUG REPORT

1) First deploy the service: $ kubectl apply -f rocketmq_v1alpha1_rocketmq_cluster.yaml -n dev

2)After a while, check the pod status: $ kubectl get pods -n dev -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES broker-0-master-0 1/1 Running 0 35s 10.244.14.94 azure-k8s-5 broker-0-replica-1-0 1/1 Running 0 35s 10.244.15.169 azure-k8s-6 name-service-0 1/1 Running 0 35s 10.0.0.9 azure-k8s-5

【bug1】 We cannot guarantee that the name-server cluster must be started before the broker clusters, which may cause the broker to fail to start or restart.

3) At this point, we modify the spec.size of the broker from 1 to 2, and then view the cluster status: $ kubectl get pods -n dev -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES broker-0-master-0 1/1 Running 0 3m43s 10.244.14.94 azure-k8s-5 broker-0-replica-1-0 1/1 Running 0 3m43s 10.244.15.169 azure-k8s-6 broker-1-master-0 1/1 Running 0 15s 10.244.11.106 azure-k8s-1 broker-1-replica-1-0 1/1 Running 0 15s 10.244.13.192 azure-k8s-4 name-service-0 1/1 Running 0 3m43s 10.0.0.9 azure-k8s-5

4) Then we modify the spec.size of the broker from 2 to 1, and then check the cluster status

$ kubectl get pods -n dev -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES broker-0-master-0 1/1 Running 0 4m35s 10.244.14.94 azure-k8s-5 broker-0-replica-1-0 1/1 Running 0 4m35s 10.244.15.169 azure-k8s-6 broker-1-master-0 1/1 Running 0 67s 10.244.11.106 azure-k8s-1 broker-1-replica-1-0 1/1 Running 0 67s 10.244.13.192 azure-k8s-4 name-service-0 1/1 Running 0 4m35s 10.0.0.9 azure-k8s-5

[bug2] We found that the cluster did not delete broker-1 related resources At this point we check the status of the broker: $ kubectl get broker -n dev broker -oyaml apiVersion: rocketmq.apache.org/v1alpha1 kind: Broker metadata: ... spec: allowRestart: true brokerImage: apacherocketmq/rocketmq-broker:4.5.0-alpine-operator-0.3.0 env:

[bug3] At this point, we find that broker.spec.size has become 1, and status has also become 1.

5) At this point, we manually fix the broker's bug, and we manually delete the related statefulset.

6) When we need to upgrade the size of the cluster, from 1=》2, we check the cluster status again. At this time, we have a probability that the broker fails to start, and the operator reports an error: [bug4] {"level":"error","ts":1600223697.3956034,"logger":"controller_broker","msg":"Failed to update Broker Size status.","Request.Namespace":"dev","Request.Name":"broker","error":"Operation cannot be fulfilled on brokers.rocketmq.apache.org \"broker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\trocketmq-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/apache/rocketmq-operator/pkg/controller/broker.(ReconcileBroker).Reconcile\n\trocketmq-operator/pkg/controller/broker/broker_controller.go:286\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\trocketmq-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1\n\trocketmq-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\trocketmq-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\trocketmq-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\trocketmq-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Then we can see that the /root/store/config/subscriptionGroup.json subscriptionGroup.json.bak of rocketmq is all . Then we can check the corresponding error log through the directory on the host machine:

com.alibaba.fastjson.JSONException: syntax error, expect {, actual error, pos 0, fastjson-version 1.2.51 at com.alibaba.fastjson.parser.deserializer.JavaBeanDeserializer.deserialze(JavaBeanDeserializer.java:474) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.parser.deserializer.JavaBeanDeserializer.deserialze(JavaBeanDeserializer.java:273) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.parser.DefaultJSONParser.parseObject(DefaultJSONParser.java:669) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:368) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:272) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:491) ~[fastjson-1.2.51.jar:na] at org.apache.rocketmq.remoting.protocol.RemotingSerializable.fromJson(RemotingSerializable.java:43) ~[rocketmq-remoting-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.subscription.SubscriptionGroupManager.decode(SubscriptionGroupManager.java:152) ~[rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.common.ConfigManager.load(ConfigManager.java:38) ~[rocketmq-common-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerController.initialize(BrokerController.java:233) [rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerStartup.createBrokerController(BrokerStartup.java:218) [rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerStartup.main(BrokerStartup.java:58) [rocketmq-broker-4.5.0.jar:4.5.0] 2020-09-15 13:43:10 ERROR main - load /root/store/config/subscriptionGroup.json Failed com.alibaba.fastjson.JSONException: syntax error, expect {, actual error, pos 0, fastjson-version 1.2.51 at com.alibaba.fastjson.parser.deserializer.JavaBeanDeserializer.deserialze(JavaBeanDeserializer.java:474) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.parser.deserializer.JavaBeanDeserializer.deserialze(JavaBeanDeserializer.java:273) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.parser.DefaultJSONParser.parseObject(DefaultJSONParser.java:669) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:368) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:272) ~[fastjson-1.2.51.jar:na] at com.alibaba.fastjson.JSON.parseObject(JSON.java:491) ~[fastjson-1.2.51.jar:na] at org.apache.rocketmq.remoting.protocol.RemotingSerializable.fromJson(RemotingSerializable.java:43) ~[rocketmq-remoting-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.subscription.SubscriptionGroupManager.decode(SubscriptionGroupManager.java:152) ~[rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.common.ConfigManager.loadBak(ConfigManager.java:56) [rocketmq-common-4.5.0.jar:4.5.0] at org.apache.rocketmq.common.ConfigManager.load(ConfigManager.java:44) [rocketmq-common-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerController.initialize(BrokerController.java:233) [rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerStartup.createBrokerController(BrokerStartup.java:218) [rocketmq-broker-4.5.0.jar:4.5.0] at org.apache.rocketmq.broker.BrokerStartup.main(BrokerStartup.java:58) [rocketmq-broker-4.5.0.jar:4.5.0] 2020-09-15 13:43:10 INFO main - Try to shutdown service thread:PullRequestHoldService started:false lastThread:null

FEATURE REQUEST

  1. The rocketmq log needs to be printed to the console and directory, so that it is convenient to collect logs and locate error information. For example, when I failed to start rocketmq, for example, when init rocketmq failed, kubectl logs -f podname would not display any error message.

  2. Rocketmq's metrics support

  3. We are currently trying to use this operator to deploy rocketmq. For the above two issues and bugs, we can submit code.

  4. When is the affinity/anti-affinity/taint released?

liuruiyiyang commented 4 years ago

Hi, thanks for your detailed feedback!

  1. bug1 for question We cannot guarantee that the name-server cluster must be started before the broker clusters:

In fact the broker controller did has a primary mechanism to wait for name-server cluster started which uses a shared variable IsNameServersStrInitialized. You can check the PR #31. However the current solution is not perfect, you can propose PR to improve it.

  1. bug2 for the down-scaling question:

Currently the operator does not support for down-scaling (shrink) yet. Therefore if you upgrade the size of the cluster from 2=》1, the controller won't do anything. The down-scaling is a very complicated process which is related to re-arrange the data and many other problems. So simply we suggest user not to apply the reduced size currently.

  1. for the feature request: We are glad to see more contributors to join the project. Welcome to submit your code of metric support and logs!

As you may see the affinity/anti-affinity/taint feature is partly supported and pending on PR #42. However the response of the community maybe slow, we will continue to progress on this feature and release it soon.

caigy commented 2 years ago

[bug 1] has been fixed.

[bug 2] & [bug 3], deleting resources is a dangerous operation, especially in production environment. So it may be more appropriate to delete resources by users manually.

[bug 4] has been fixed.