[Heron-3724] Separate the Manager and Executors.

surahman commented 2 years ago

Feature #3724: Management/Driver pod should not use the same resources as the other topology pods. Should create separate deployment/service.

This PR builds upon #3725 in order to test the functionality of Volume Claim configurations in the Manager. Once #3725 is merged into master I will hard-reset the feature branch for this PR onto master (this may temporarily close the PR). After that, I will merge the dev branch with the feature branch again and resolve any merge conflicts which may arise.

The following are the current features and I am soliciting input across all areas:

The StatefulSets are named [topology-name]-manager and [topology-name]-executors.
A single headless Service is used.
The Manager is a duplicate of the Executor StatefulSet.
Both StatefulSets, the Service, and all Volume Claims which are generated for the topology are removed on termination.
restart will restart the Manager and Executors for a topology.
addContainers will only add Executor containers.
removeContainers will only remove Executor containers.
patchStatefulSetReplicas will only patch Executor containers.

Usage

The command pattern is as follows: heron.kubernetes.manager.[limits | requests].[OPTION]=[VALUE]

The currently supported CLI options are:

cpu
memory

cpu must be natural number and memory must be a positive decimal indicating a value in Gigabytes.

Example:

~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \
org.apache.heron.examples.api.AckingTopology acking \
--verbose \
--config-property heron.kubernetes.pod.template.configmap.name=pod-templ-cf-map.pod-template.yaml \
--config-property heron.kubernetes.manager.limits.cpu=2 \
--config-property heron.kubernetes.manager.limits.memory=3 \
--config-property heron.kubernetes.manager.requests.cpu=1 \
--config-property heron.kubernetes.manager.requests.memory=2 \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.claimName=OnDemand \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.sizeLimit=256Gi \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.volumeMode=Block \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.path=path/to/mount/dynamic/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.subPath=sub/path/to/mount/dynamic/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.claimName=OnDemand \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.storageClassName=storage-class-name \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.sizeLimit=512Gi \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.volumeMode=Block \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.path=path/to/mount/static/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.subPath=sub/path/to/mount/static/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.claimName=requested-claim-by-user \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.path=path/to/mount/shared/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.subPath=sub/path/to/mount/shared/volume

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-23T23:56:06Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "1706" uid: 2117b2e9-248e-4d2c-a4cc-7ff1be45375f spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-6322466307195919246.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking242db601-6bb5-4703-b2f3-f38f0a3f8a0c --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "2" memory: 3Gi requests: cpu: "1" memory: 2Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: dynamicvolume subPath: sub/path/to/mount/dynamic/volume - mountPath: /shared_volume name: shared-volume - mountPath: path/to/mount/shared/volume name: sharedvolume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: staticvolume subPath: sub/path/to/mount/static/volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - emptyDir: {} name: shared-volume - name: sharedvolume persistentVolumeClaim: claimName: requested-claim-by-user updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: dynamicvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: staticvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-7464c5697 observedGeneration: 1 replicas: 1 updateRevision: acking-manager-7464c5697 updatedReplicas: 1 ```

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-23T23:56:06Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "1704" uid: 73c3dfcf-2810-4060-8963-138a41d0d4c0 spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-6322466307195919246.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking242db601-6bb5-4703-b2f3-f38f0a3f8a0c --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "3" memory: 4Gi requests: cpu: "3" memory: 4Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: dynamicvolume subPath: sub/path/to/mount/dynamic/volume - mountPath: /shared_volume name: shared-volume - mountPath: path/to/mount/shared/volume name: sharedvolume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: staticvolume subPath: sub/path/to/mount/static/volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - emptyDir: {} name: shared-volume - name: sharedvolume persistentVolumeClaim: claimName: requested-claim-by-user updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: dynamicvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: staticvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-6467c98557 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-6467c98557 updatedReplicas: 2 ```

surahman commented 2 years ago

Phase 2

[x] Load Pod Templates via heron.kubernetes.[executor | manager].pod.template=[ConfigMap].[Pod Template]
[ ] Configure Persistent Volumes via heron.kubernetes.[executor | manager].volumes.persistentVolumeClaim.[Volume Name].[Option]=[Value]
[ ] Configure resources for the manager via heron.kubernetes.manager.[limits | requests].[Option]=[Value]

Individual Pod Template loading for the Executors and Manager is now complete. Please note that the commands have changed to load Pod Templates and that the old Pod Template docs are now mostly obsolete. On the TODO list is individual PVC and resource CLI commands.

Commands:

~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \
org.apache.heron.examples.api.AckingTopology acking \
--verbose \
--deploy-deactivated \
--config-property heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml \
--config-property heron.kubernetes.manager.pod.template=pod-templ-manager.pod-template-manager.yaml \
--config-property heron.kubernetes.manager.limits.cpu=2 \
--config-property heron.kubernetes.manager.limits.memory=3 \
--config-property heron.kubernetes.manager.requests.cpu=1 \
--config-property heron.kubernetes.manager.requests.memory=2 \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.claimName=OnDemand \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.sizeLimit=256Gi \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.volumeMode=Block \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.path=path/to/mount/dynamic/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.dynamicvolume.subPath=sub/path/to/mount/dynamic/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.claimName=OnDemand \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.storageClassName=storage-class-name \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.sizeLimit=512Gi \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.volumeMode=Block \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.path=path/to/mount/static/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.staticvolume.subPath=sub/path/to/mount/static/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.claimName=requested-claim-by-user \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.path=path/to/mount/shared/volume \
--config-property heron.kubernetes.volumes.persistentVolumeClaim.sharedvolume.subPath=sub/path/to/mount/shared/volume

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-29T00:14:41Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "2512" uid: f4aa0815-256d-4c73-8ce7-68ff0bb26597 spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-7629098208556017113.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking6e4ded2b-ee0a-40ac-90ad-8780645bda9a --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "3" memory: 4Gi requests: cpu: "3" memory: 4Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: dynamicvolume subPath: sub/path/to/mount/dynamic/volume - mountPath: /shared_volume name: shared-volume - mountPath: path/to/mount/shared/volume name: sharedvolume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: staticvolume subPath: sub/path/to/mount/static/volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - emptyDir: {} name: shared-volume - name: sharedvolume persistentVolumeClaim: claimName: requested-claim-by-user updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: dynamicvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: staticvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-bc4fd98c4 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-bc4fd98c4 updatedReplicas: 2 ```

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-29T00:14:41Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "2513" uid: 4e8e0e7a-8d20-4a1f-8cac-db9319af1cec spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-7629098208556017113.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking6e4ded2b-ee0a-40ac-90ad-8780645bda9a --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one_manager value: variable one on manager - name: var_three_manager value: variable three on manager - name: var_two_manager value: variable two on manager image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 7775 name: tcp-port-kept protocol: TCP - containerPort: 7776 name: udp-port-kept protocol: UDP resources: limits: cpu: "2" memory: 3Gi requests: cpu: "1" memory: 2Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: dynamicvolume subPath: sub/path/to/mount/dynamic/volume - mountPath: /shared_volume/manager name: shared-volume-manager - mountPath: path/to/mount/shared/volume name: sharedvolume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: staticvolume subPath: sub/path/to/mount/static/volume - image: alpine imagePullPolicy: Always name: manager-sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume/manager name: shared-volume-manager dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - emptyDir: {} name: shared-volume-manager - name: sharedvolume persistentVolumeClaim: claimName: requested-claim-by-user updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: dynamicvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: staticvolume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-677c8b875b observedGeneration: 1 replicas: 1 updateRevision: acking-manager-677c8b875b updatedReplicas: 1 ```

surahman commented 2 years ago

Phase 2

[x] Load Pod Templates via heron.kubernetes.[executor | manager].pod.template=[ConfigMap].[Pod Template]
[x] Configure Persistent Volumes via heron.kubernetes.[executor | manager].volumes.persistentVolumeClaim.[Volume Name].[Option]=[Value]
[x] Configure resources for the manager and/or executor via heron.kubernetes.[executor | manager].[limits | requests].[Option]=[Value]

Please note that the commands have changed to load Pod Templates and PVCs and as such the docs are now mostly obsolete.

Commands:

~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \
org.apache.heron.examples.api.AckingTopology acking \
--verbose \
--deploy-deactivated \
--config-property heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml \
--config-property heron.kubernetes.manager.pod.template=pod-templ-manager.pod-template-manager.yaml \
--config-property heron.kubernetes.manager.limits.cpu=2 \
--config-property heron.kubernetes.manager.limits.memory=3 \
--config-property heron.kubernetes.manager.requests.cpu=1 \
--config-property heron.kubernetes.manager.requests.memory=2 \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.claimName=OnDemand \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.sizeLimit=256Gi \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.volumeMode=Block \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.path=path/to/mount/dynamic/volume \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.claimName=OnDemand \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.storageClassName=storage-class-name \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.sizeLimit=512Gi \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.volumeMode=Block \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.path=path/to/mount/static/volume \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.subPath=sub/path/to/mount/static/volume \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.claimName=requested-claim-by-user \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.path=path/to/mount/shared/volume \
--config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.subPath=sub/path/to/mount/shared/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.claimName=OnDemand \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.sizeLimit=256Gi \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.volumeMode=Block \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.path=path/to/mount/dynamic/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.claimName=OnDemand \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.storageClassName=storage-class-name \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.sizeLimit=512Gi \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.volumeMode=Block \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.path=path/to/mount/static/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.subPath=sub/path/to/mount/static/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.claimName=requested-claim-by-user \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.path=path/to/mount/shared/volume \
--config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.subPath=sub/path/to/mount/shared/volume

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-30T00:08:01Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "1650" uid: 24e8e2fc-fc33-4189-996c-dce430bcc68f spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0--1632273069134658892.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking60a8ecb7-e031-4afc-9bff-8a18703aef3a --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "3" memory: 4Gi requests: cpu: "3" memory: 4Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: executor-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: executor-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: executor-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume name: shared-volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: executor-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-648bfd4494 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-648bfd4494 updatedReplicas: 2 ```

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-11-30T00:08:01Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "1637" uid: 84f96cb2-093a-47d7-8882-98cf7833219d spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0--1632273069134658892.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking60a8ecb7-e031-4afc-9bff-8a18703aef3a --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one_manager value: variable one on manager - name: var_three_manager value: variable three on manager - name: var_two_manager value: variable two on manager image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 7775 name: tcp-port-kept protocol: TCP - containerPort: 7776 name: udp-port-kept protocol: UDP resources: limits: cpu: "2" memory: 3Gi requests: cpu: "1" memory: 2Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: manager-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: manager-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: manager-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume/manager name: shared-volume-manager - image: alpine imagePullPolicy: Always name: manager-sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume/manager name: shared-volume-manager dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: manager-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume-manager updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-56cff7454d observedGeneration: 1 replicas: 1 updateRevision: acking-manager-56cff7454d updatedReplicas: 1 ```

surahman commented 2 years ago

This PR combines all the functionality to customize the Heron execution environment in Kubernetes.

The documentation at this point in the PR can be found here.

[x] Separation of the Heron Manager and Executors:
- Two StatefulSets per topology.
- Manager named [topology-name]-manager with a single replica.
- Executor named [topology-name]-executors with multiple replicas.
- Both StatefulSets, the headless Service, and all Persistent Volume Claims which are generated for the topology are removed on termination.
- restart will restart the Manager and all Executors for a topology.
- addContainers will only add to the replica count for Executors.
- removeContainers will only decrement the replica count for Executors.
- patchStatefulSetReplicas will only patch the StatefulSet for Executors.
[x] Loading of Pod Templates via heron.kubernetes.[executor | manager].pod.template=[ConfigMap].[Pod Template Name].
[x] Configure Persistent Volumes via heron.kubernetes.[executor | manager].volumes.persistentVolumeClaim.[Volume Name].[Option]=[Value]
[x] Configure resources for the manager and/or executor via heron.kubernetes.[executor | manager].[limits | requests].[Option]=[Value]

I have completed some deployment testing and this PR is now available for review and broader testing.

Submit command

```bash ~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \ org.apache.heron.examples.api.AckingTopology acking \ --verbose \ --config-property heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml \ --config-property heron.kubernetes.manager.pod.template=pod-templ-manager.pod-template-manager.yaml \ --config-property heron.kubernetes.manager.limits.cpu=2 \ --config-property heron.kubernetes.manager.limits.memory=3 \ --config-property heron.kubernetes.manager.requests.cpu=1 \ --config-property heron.kubernetes.manager.requests.memory=2 \ --config-property heron.kubernetes.executor.limits.cpu=5 \ --config-property heron.kubernetes.executor.limits.memory=6 \ --config-property heron.kubernetes.executor.requests.cpu=2 \ --config-property heron.kubernetes.executor.requests.memory=1 \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.subPath=sub/path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.subPath=sub/path/to/mount/shared/volume ```

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-02T00:12:20Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "1216" uid: c823bb62-c798-46e2-8f7c-ec7f66a663ac spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1634139749345622293.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking92ff5e65-2f7c-42c1-b8f3-aa3d9e3847d6 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one_manager value: variable one on manager - name: var_three_manager value: variable three on manager - name: var_two_manager value: variable two on manager image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 7775 name: tcp-port-kept protocol: TCP - containerPort: 7776 name: udp-port-kept protocol: UDP resources: limits: cpu: "2" memory: 3Mi requests: cpu: "1" memory: 2Mi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: manager-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: manager-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: manager-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume/manager name: shared-volume-manager - image: alpine imagePullPolicy: Always name: manager-sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume/manager name: shared-volume-manager dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: manager-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume-manager updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-7596cff587 observedGeneration: 1 replicas: 1 updateRevision: acking-manager-7596cff587 updatedReplicas: 1 ```

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-02T00:12:20Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "1211" uid: 3ec133e2-591e-4864-b054-478021b8062d spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1634139749345622293.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking92ff5e65-2f7c-42c1-b8f3-aa3d9e3847d6 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "5" memory: 6Mi requests: cpu: "2" memory: 1Mi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: executor-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: executor-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: executor-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume name: shared-volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: executor-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-68f9654bd9 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-68f9654bd9 updatedReplicas: 2 ```

nicknezis commented 2 years ago

Tested a deployment of acking topology without specifying any parameters. When trying to kill the topology, I received the following stack trace in the heron-apiserver:

[2021-12-02 20:10:38 +0000] [INFO] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes
2021-12-02 20:10:38,231 ERROR org.apache.heron.apiserver.resources.TopologyResource kill qtp2045766957-13 error killing topology acking
java.lang.IllegalStateException: closed
    at okio.RealBufferedSource.select(RealBufferedSource.java:93)
    at okhttp3.internal.Util.bomAwareCharset(Util.java:467)
    at okhttp3.ResponseBody.string(ResponseBody.java:181)
    at org.apache.heron.scheduler.kubernetes.KubernetesUtils.errorMessageFromResponse(KubernetesUtils.java:77)
    at org.apache.heron.scheduler.kubernetes.V1Controller.deleteStatefulSets(V1Controller.java:313)
    at org.apache.heron.scheduler.kubernetes.V1Controller.killTopology(V1Controller.java:165)
    at org.apache.heron.scheduler.kubernetes.KubernetesScheduler.onKill(KubernetesScheduler.java:113)
    at org.apache.heron.scheduler.client.LibrarySchedulerClient.killTopology(LibrarySchedulerClient.java:61)
    at org.apache.heron.scheduler.RuntimeManagerRunner.killTopologyHandler(RuntimeManagerRunner.java:173)
    at org.apache.heron.scheduler.RuntimeManagerRunner.call(RuntimeManagerRunner.java:98)
    at org.apache.heron.scheduler.RuntimeManagerMain.callRuntimeManagerRunner(RuntimeManagerMain.java:498)
    at org.apache.heron.scheduler.RuntimeManagerMain.manageTopology(RuntimeManagerMain.java:411)
    at org.apache.heron.apiserver.actions.TopologyRuntimeAction.execute(TopologyRuntimeAction.java:39)
    at org.apache.heron.apiserver.resources.TopologyResource.kill(TopologyResource.java:498)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)

Edit: Perhaps this is due to a missing deletecollection verb on StatefulSet in the Role? I'll test this out later tonight.

surahman commented 2 years ago

Edit: Perhaps this is due to a missing deletecollection verb on StatefulSet in the Role? I'll test this out later tonight.

I can confirm that I am able to shut down a topology cleanly on Minikube when deploying without Helm charts. You might be using an old Helm chart or I could be missing additionally required permissions, the PR currently has the following:

- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
  - deletecollection

nicknezis commented 2 years ago

Perhaps this is due to a missing deletecollection verb on StatefulSet in the Role? I'll test this out later tonight.

That fixed it. I'll push up a code edit shortly.

nicknezis commented 2 years ago

Ok, just realized that I was using an old Helm chart. The branch already has the fix. Apologies.

I think this is ready to be merged. If we want to lower the default Request/Limit values for the manager pod, we can always do that in a future PR. I'll leave the decision to @surahman. I think it would be nice to have, but having the ability to customize the values is a huge improvement.

windhamwong commented 2 years ago

--config-property heron.kubernetes.manager.limits.memory=3 \ just wondering if there is a better way to set the amount of memory? i.e. Using KiB as the unit, 3000 instead of just 3 for 3MiB?

surahman commented 2 years ago

Thank you @nicknezis and @windhamwong for taking the time to test and review that PR 😄.

I have pushed some fixes to the documentation but the code remains unchanged.

If we want to lower the default Request/Limit values for the manager pod, we can always do that in a future PR.

You would need a large and diverse dataset to form a baseline for the default resource values. I feel we would need to solicit statistics from users for topologies with varying configurations (bolts and spouts) as well as data velocity and volume to form a baseline.

just wondering if there is a better way to set the amount of memory?

The reason I choose Megabytes is that we typically use Gigabytes for the units when working with memory (volatile and non-volatile) and we may need to work with fractions of a Gigabyte. I do not feel we would need the granularity of Kilobytes when working with memory and that it would make the command tedious to use (eg.: 1,000,000 KB vs 1,000 MB for 1 Gigabyte). That said, changing the units would not be difficult, it would mean one change in production code and a handful in the test suite.

Edit: I think we are good to merge.

joshfischer1108 commented 2 years ago

Nice work @surahman . All, let's wait till 24 hours after @nicknezis approval before merging.

nicknezis commented 2 years ago

If we want to lower the default Request/Limit values for the manager pod, we can always do that in a future PR.

You would need a large and diverse dataset to form a baseline for the default resource values. I feel we would need to solicit statistics from users for topologies with varying configurations (bolts and spouts) as well as data velocity and volume to form a baseline.

So the bolts and spouts and strmgr processes all live in the executor pods. Data velocity and stuff like that is all in the executor. Manager just has a few processes that collect metrics and manage coordination of checkpointing (if used). Also if the physical plan changes it will coordinate changes through zookeeper. (It does more, but trying to give examples of the types of operations).

I still think we can default to smaller value with low risk, but agree the risk is not 0%. After this is merged, I'll do some analysis on workloads at work to see what could be a good default. @windhamwong provided numbers also give me confidence because they match what I've observed. But I'll do a more rigorous process to capture numbers.

But what I said above doesn't counter the decision to merge as is. I agree with @surahman decision.

surahman commented 2 years ago

I still think we can default to smaller value with low risk, but agree the risk is not 0%. After this is merged, I'll do some analysis on workloads at work to see what could be a good default. @windhamwong provided numbers also give me confidence because they match what I've observed. But I'll do a more rigorous process to capture numbers.

I agree. Once we are sure of the changes which need to be made we can go ahead and figure out how to affect this change in a future PR.

I have one last set of typos corrections I am making to the documentation that I will merge in later today.

surahman commented 2 years ago

@nicknezis I have updated the Resource CLI commands to support native units from K8s to allow the specification of unit suffixes. Sorry, but this will need your approval again :confounded:. Updated documentation is here and a sample command with the YAML output for the Executors and Manager StatefulSets is below.

This should fill @windhamwong's request for Kilobyte support via the k/Ki suffix.

Commands

```bash ~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \ org.apache.heron.examples.api.AckingTopology acking \ --verbose \ --deploy-deactivated \ --config-property heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml \ --config-property heron.kubernetes.manager.pod.template=pod-templ-manager.pod-template-manager.yaml \ --config-property heron.kubernetes.manager.limits.cpu=2000m \ --config-property heron.kubernetes.manager.limits.memory=300Mi \ --config-property heron.kubernetes.manager.requests.cpu=1000m \ --config-property heron.kubernetes.manager.requests.memory=200Mi \ --config-property heron.kubernetes.executor.limits.cpu=5 \ --config-property heron.kubernetes.executor.limits.memory=6Gi \ --config-property heron.kubernetes.executor.requests.cpu=2 \ --config-property heron.kubernetes.executor.requests.memory=1Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.subPath=sub/path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.subPath=sub/path/to/mount/shared/volume ```

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-03T22:36:48Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "787" uid: d93e7e8d-e690-4e72-96bd-2b327fff9ecc spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1268791470655715640.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking5d5d16b0-7b36-4662-9690-658afec32555 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one_manager value: variable one on manager - name: var_three_manager value: variable three on manager - name: var_two_manager value: variable two on manager image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 7775 name: tcp-port-kept protocol: TCP - containerPort: 7776 name: udp-port-kept protocol: UDP resources: limits: cpu: "2" memory: 300Mi requests: cpu: "1" memory: 200Mi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: manager-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: manager-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: manager-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume/manager name: shared-volume-manager - image: alpine imagePullPolicy: Always name: manager-sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume/manager name: shared-volume-manager dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: manager-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume-manager updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-5f576f75cc observedGeneration: 1 replicas: 1 updateRevision: acking-manager-5f576f75cc updatedReplicas: 1 ```

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-03T22:36:48Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "789" uid: ce141b3b-b7f6-43ba-8442-57c63b528be3 spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1268791470655715640.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking5d5d16b0-7b36-4662-9690-658afec32555 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "5" memory: 6Gi requests: cpu: "2" memory: 1Gi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: executor-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: executor-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: executor-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume name: shared-volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: executor-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-675c888b5 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-675c888b5 updatedReplicas: 2 ```

windhamwong commented 2 years ago

Thanks. Let me test out over the weekend.

nicknezis commented 2 years ago

I ran a test with acking topology. Was able to submit with default and also override the values. I think it's good to merge.

surahman commented 2 years ago

Thank you, Nick, let us give Windham some time to test this out before merging.

windhamwong commented 2 years ago

Im testing on the build, and wondering the needs of heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml Is this necessary on deployment and where is this file located?

windhamwong commented 2 years ago

I don't have problem with the PVC but just wondering, do we need

--config-property heron.kubernetes.executor.limits.cpu=5 \
--config-property heron.kubernetes.executor.limits.memory=6Gi \
--config-property heron.kubernetes.executor.requests.cpu=2 \
--config-property heron.kubernetes.executor.requests.memory=1Gi \

as we already got

  constants.TOPOLOGY_CONTAINER_CPU_REQUESTED
  constants.TOPOLOGY_CONTAINER_RAM_REQUESTED

in each topology.

Will there be a way to implement the config in topology constants so we can change the value for specific topology?

windhamwong commented 2 years ago

I think we got another bug here. As the Python version using is 3.8 under your pr branch, it warns the Python library kazoo (used in zookeeper connection). Heron Tracker uses kazoo 2.7.0 but Python 3.8 is not compatible with it and has to upgrade kazoo to 2.8.0. Warning:

/root/.pex/installed_wheels/6e40458c80f1b6a2bb9c38603c9fe8a17f0aa169/kazoo-2.7.0-py2.py3-none-any.whl/kazoo/protocol/serialization.py:114: SyntaxWarning: "is"
 with a literal. Did you mean "=="?
  read_only = bool_struct.unpack_from(bytes, offset)[0] is 1
/root/.pex/installed_wheels/6e40458c80f1b6a2bb9c38603c9fe8a17f0aa169/kazoo-2.7.0-py2.py3-none-any.whl/kazoo/protocol/serialization.py:449: SyntaxWarning: "is" with a literal. Did you mean "=="?

Shall have another PR for this :D

surahman commented 2 years ago

Im testing on the build, and wondering the needs of heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml Is this necessary on deployment and where is this file located?

@windhamwong It is not required, and if not supplied a default Executor and Manager pod will be deployed. You will need to load the Pod Template into a ConfigMap. Please see the first section of the documentation for usage and details. This functionality was initially introduced in an earlier PR.

I don't have problem with the PVC but just wondering, do we need
--config-property heron.kubernetes.executor.limits.cpu=5 \
--config-property heron.kubernetes.executor.limits.memory=6Gi \
--config-property heron.kubernetes.executor.requests.cpu=2 \
--config-property heron.kubernetes.executor.requests.memory=1Gi \
as we already got
  constants.TOPOLOGY_CONTAINER_CPU_REQUESTED
  constants.TOPOLOGY_CONTAINER_RAM_REQUESTED
in each topology.

Will there be a way to implement the config in topology constants so we can change the value for specific topology?

This functionality mirrors what is available in Spark and permits deployment time tweaking of resources without having to repackage a Jar file your topology. The ability to configure resources via configs remains unchanged but the CLI commands take precedence to facilitate the aforementioned functionality.

I think we got another bug here. As the Python version using is 3.8 under your pr branch, it warns the Python library kazoo (used in zookeeper connection). Heron Tracker uses kazoo 2.7.0 but Python 3.8 is not compatible with it and has to upgrade kazoo to 2.8.0. Warning:
/root/.pex/installed_wheels/6e40458c80f1b6a2bb9c38603c9fe8a17f0aa169/kazoo-2.7.0-py2.py3-none-any.whl/kazoo/protocol/serialization.py:114: SyntaxWarning: "is"
 with a literal. Did you mean "=="?
  read_only = bool_struct.unpack_from(bytes, offset)[0] is 1
/root/.pex/installed_wheels/6e40458c80f1b6a2bb9c38603c9fe8a17f0aa169/kazoo-2.7.0-py2.py3-none-any.whl/kazoo/protocol/serialization.py:449: SyntaxWarning: "is" with a literal. Did you mean "=="?
Shall have another PR for this :D

Good catch, please open a new issue if you have not already. These are not changes that were introduced in this PR and are most likely associated with updates to facilitate building on macOS.

surahman commented 2 years ago

Please double check your commands and script. I see a trailing " in the command below.

  heron submit heron /opt/src.pex - $2 --verbose \
  --config-property heron.kubernetes.manager.limits.cpu=300m \
  --config-property heron.kubernetes.manager.limits.memory=300Mi \
  --config-property heron.kubernetes.manager.requests.cpu=20m \
  --config-property heron.kubernetes.manager.requests.memory=100Mi"

Here is my run on acking with your settings and everything is inorder:

~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \
org.apache.heron.examples.api.AckingTopology acking \
--verbose \
--config-property heron.kubernetes.manager.limits.cpu=300m \
--config-property heron.kubernetes.manager.limits.memory=300Mi \
--config-property heron.kubernetes.manager.requests.cpu=20m \
--config-property heron.kubernetes.manager.requests.memory=100Mi

Manager StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-05T16:12:40Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "858" uid: 6be69dad-2943-4813-9c26-7ed5e185a0e1 spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0--4532610184000198972.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking2d4d1f63-90db-435d-9e2b-6be7f5bfc0ee --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP resources: limits: cpu: 300m memory: 300Mi requests: cpu: 20m memory: 100Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate status: availableReplicas: 1 collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-f5f4784 observedGeneration: 1 readyReplicas: 1 replicas: 1 updateRevision: acking-manager-f5f4784 updatedReplicas: 1 ```

Executor StatefulSet

```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-05T16:12:40Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "862" uid: afe8b301-953d-49fa-af05-37289f9cf721 spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0--4532610184000198972.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking2d4d1f63-90db-435d-9e2b-6be7f5bfc0ee --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP resources: limits: cpu: "3" memory: 4Gi requests: cpu: "3" memory: 4Gi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate status: availableReplicas: 2 collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-548c6dbd6c observedGeneration: 1 readyReplicas: 2 replicas: 2 updateRevision: acking-executors-548c6dbd6c updatedReplicas: 2 ```

windhamwong commented 2 years ago

my test shows successful with the correct manager resource requests/limits.

surahman commented 2 years ago

@windhamwong thank you for also taking the time to review and test! 😄

joshfischer1108 commented 2 years ago

Alright. Let's get this merged, @surahman . Nice work 💯

apache / incubator-heron