Nifi scale up - Githubissues

iordaniordanov commented 2 years ago

Type of question

About general context and help around nifikop

Question

What did you do? Increased number of nodes in the nificluster CR from 3 to 6

What did you expect to see? 3 new nodes to be simultaneously created and joined in the cluster

What did you see instead? Under which circumstances? 3 new nodes were simultaneously created, they join the cluster, but after that they are one by one re-created and only after that the cluster is fully functional, which leads to a linear increase in the amount of time which is needed to scale the cluster up. If adding one node takes 5 min adding 2 nodes takes ~10 min and so on. Is this the expected behavior or it is an issue with our configuration/environment ?

Environment

nifikop version:

v0.6.0
Kubernetes version information:

Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-087e67", GitCommit:"087e67e479962798594218dc6d99923f410c145e", GitTreeState:"clean", BuildDate:"2021-07-31T01:39:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind: EKS
NiFi version: 1.12.1

Additional context Nifi cluster config

apiVersion: nifi.orange.com/v1alpha1
kind: NifiCluster
metadata:
  name: <name>
  namespace: <namespace>
spec:
  clusterImage: <image> # Nifi 1.12.1 image
  externalServices:
  - name: clusterip
    spec:
      portConfigs:
      - internalListenerName: http
        port: 8080
      type: ClusterIP
  initContainerImage: <busybox image>
  listenersConfig:
    internalListeners:
    - containerPort: 8080
      name: http
      type: http
    - containerPort: 6007
      name: cluster
      type: cluster
    - containerPort: 10000
      name: s2s
      type: s2s
    - containerPort: 9090
      name: prometheus
      type: prometheus
  nifiClusterTaskSpec:
    retryDurationMinutes: 10
  nodeConfigGroups:
    default_group:
      isNode: true
      resourcesRequirements:
        limits:
          cpu: "2"
          memory: 6Gi
        requests:
          cpu: "2"
          memory: 6Gi
      serviceAccountName: default
      storageConfigs:
      - mountPath: /opt/nifi/data
        name: data
        pvcSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 30Gi
          storageClassName: general
      - mountPath: /opt/nifi/content_repository
        name: content-repository
        pvcSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
          storageClassName: general
      - mountPath: /opt/nifi/flowfile_repository
        name: flowfile-repository
        pvcSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
          storageClassName: general
      - mountPath: /opt/nifi/provenance_repository
        name: provenance-repository
        pvcSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
          storageClassName: general
      - mountPath: /opt/nifi/nifi-current/work
        name: work
        pvcSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 5Gi
          storageClassName: general
  nodes:
  - id: 0
    nodeConfigGroup: default_group
  - id: 1
    nodeConfigGroup: default_group
  - id: 2
    nodeConfigGroup: default_group
  oneNifiNodePerNode: true
  propagateLabels: true
  readOnlyConfig:
    bootstrapProperties:
      nifiJvmMemory: 2g
      overrideConfigs: |
        java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000
        conf.dir=./conf
    nifiProperties:
      overrideConfigs: |
        nifi.nar.library.autoload.directory=./extensions
        nifi.web.http.network.interface.default=eth0
        nifi.web.http.network.interface.lo=lo
        nifi.web.proxy.context.path=<proxy_path>
        nifi.database.directory=/opt/nifi/data/database_repository
        nifi.flow.configuration.archive.dir=/opt/nifi/data/archive
        nifi.flow.configuration.file=/opt/nifi/data/flow.xml.gz
        nifi.templates.directory=/opt/nifi/data/templates
        nifi.provenance.repository.max.storage.size=2GB
        nifi.provenance.repository.indexed.attributes=te$containerId,te$id
      webProxyHosts:
      - <proxy_host>
    zookeeperProperties: {}
  service:
    headlessEnabled: true
  zkAddress: <zk_addr>
  zkPath: <zk_path>

iordaniordanov commented 2 years ago

Hello, any info here ?

erdrix commented 2 years ago

Hello, yes this is the expected behaviour, we are forced to because if all the init cluster nodes are down, it could lead in the situation where the new joining node decides to be the reference, and in this case all information would be erased from the other nodes once they rejoin ... So we have an init script specific for new node, and once the node has explicitly joined the cluster, we need to restart the pod with a "non-joining" script : https://github.com/Orange-OpenSource/nifikop/blob/master/pkg/resources/nifi/pod.go#L392

iordaniordanov commented 2 years ago

Okey, thanks for the clarification :)

iordaniordanov commented 2 years ago

I'm sure you thought it trough, but just a suggestion - maybe before scaling up you can check if the cluster reports that it is healthy and if it is not abort the scale operation, because otherwise if someone wants to add lets say 50 nodes because of a spike in usage in this case he needs to wait for multiple hours before all nodes successfully join the cluster ...

Orange-OpenSource / nifikop

Nifi scale up #139

Type of question

Question