cetic / helm-nifi

Helm Chart for Apache Nifi
Apache License 2.0
215 stars 225 forks source link

[cetic/nifi] How to automatically sclae up/scale down #31

Closed aluneau closed 4 years ago

aluneau commented 4 years ago

Hi,

I'm in trouble at scaling up/scaling down my apache nifi instance on kubernetes. In Kubernetes everythings look fine but in apache nifi, scale up looks fine but scaling down create a shadow node that I need to destroy in the nifi interface. Do you know a way to do it properly ?

Thanks by advance !

devopsdymyr commented 4 years ago

same issue for me also,i have increased node to 6 and scale down not working all of my pvc still available in my cluster image

aluneau commented 4 years ago

I tried to modify the preStop hook without success. I don't know how to deal with this problem.

Do you have an idea to investigate ?

devopsdymyr commented 4 years ago

No idea but i need to look on this to remove

devopsdymyr commented 4 years ago

@bloudman Did you get any solution for this issue

ceastman-ibm commented 4 years ago

@bloudman are your nifi cluster nodes setup with tls? im having a problem scaling past 1 node. soon as the 2nd node is running i get errors like this [apache-nifi-1 app-log] 2019-12-04 12:49:53,846 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling 'CONNECTION_REQUEST' protocol message due to: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors.

my mutli node cluster with tls enabled is now working correctly. I do have a related issue to the OP when we patch our kubernetes cluster, during the patching one of the nifi nodes it shutdown which then causes the flow to be locked until that nifi node is started back up and reconnected to the cluster.

devopsdymyr commented 4 years ago

Any solution for this bug

erdrix commented 4 years ago

Hi, to perform a clean scale down, I suggest you the following steps :

  1. Create a dummy pod (to perform some actions) :
$ cat << EOF | kubectl apply -n default -f -
kind: Pod
apiVersion: v1
metadata:
  name: marks-dummy-pod
spec:
  containers:
    - name: marks-dummy-pod
      image: ubuntu
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5 ; done"]
  restartPolicy: Never
EOF
  1. Once the pod is started, open a bash into :
$ kubectl -n default exec -it marks-dummy-pod  -- /bin/bash
(pod)$ apt update 
(pod)$ git clone https://github.com/erdrix/nifi-api-client-python.git
(pod)$ cd nifi-api-client-python
  1. Get the url of your nifi LoadBalancer service (outside the the pod):
$ kubectl get service -n <Nifi namespace>
  1. Get the last statefulset's nifi node name (during scale down the Statefulset will remove the last one in the sequence) :
(pod)$ python nifi-client-python.py --url http://<nifi-service-external-ip>:<nifi-service-port>/nifi-api --action cluster

Note : The node name should have the following pattern <pod_name>.nifi-headless.<namespace_name>.svc.cluster.local.

  1. We are now ready to perform the scale down :
(pod)$ python nifi-client-python.py --url http://<nifi-service-external-ip>:<nifi-service-port>//nifi-api --action decommission --node <node_name> --nodePort <pod_port | by default : 8080>
(pod)$ python nifi-client-python.py --url http://<nifi-service-external-ip>:<nifi-service-port>//nifi-api --action remove --node <node_name> 
$ kubectl scale -n <nifi_namespace> sts <nifi_statefulset_name> --replicas=<current_replica_minus_1>

The first call will disconnect the Nifi Node from the cluster, stop all Inputs processors and wait all queues will be drained. The second one, will remove node from nifi cluster config (which is into zookeeper). And the last command perform the scale down.

Note : This should be performed by an Operator (cloudera have one in closed beta).

stoetti commented 4 years ago

@alexnuttinck is there a reason the commit with the fix for this was not yet merged into master?

Asking cause I need a solution for this issue and need to know if that is the way to go.

edit: I figured it out and post a new comment with my idea for a solution

banzo commented 4 years ago

He is currently on personal leave, but will be back in the office very soon.

stoetti commented 4 years ago

I tried to come up with a clear decomissioning of a node upon shutdown, as proposed in the nifi doc here (https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#decommission-nodes).

To get a better understanding why my solution looks like it does here a short list of the way towards it:

I created pull request #57 to start a discussion about the solution, any suggestions that make it better and more robust are welcome.

alexnuttinck commented 4 years ago

Hello @stoetti,

@alexnuttinck is there a reason the commit with the fix for this was not yet merged into master?

I think I still had some issues to correct before merging.

I tried to come up with a clear decomissioning of a node upon shutdown, as proposed in the nifi doc here (https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#decommission-nodes).

Good, it's the documentation I followed too.

as soon as nifi is shutdown, the main process of the "server"-container is stopped and therefor a preStop-hook for the server-container itself is not working

Yes, it's the issue I had, thanks for the report!

the output of the lifecycle-hooks is not available through k8s-api so I redirected the output to a temporary-file and "tail" that file

It seems to be an option.

for easier development I extended the main template to support "extraContainers" declared via values-file

Interesting, but I think this extraContainer should be directly in the Statefulset definition and not the values.yaml, but proposing extraContainers in the values.yaml is a good idea.

the terminationGracePeriod of the pod is configurable via values

Great.

I created pull request #57 to start a discussion about the solution, any suggestions that make it better and more robust are welcome.

Ok, I will have a look asap. Thanks for your investigations @stoetti !

alexnuttinck commented 4 years ago

Thanks to @stoetti, this bug is now fixed. Please, reopen this issue if you still have a problem.

alexnuttinck commented 4 years ago

Fix is merged in v0.4.2 of the Helm Chart.

stoetti commented 4 years ago

The fix still has an issue when the pod nifi-0 needs to restart the decomissioning of that node does not work properly.

alexnuttinck commented 4 years ago

Ok @stoetti thanks for the report, I propose to repoen this issue to keep that in mind.