c-scale-community / use-case-aquamonitor

Apache License 2.0
2 stars 1 forks source link

Install OpenEO Platform on INCD #9

Closed backeb closed 3 years ago

backeb commented 3 years ago

Use simple spark batch job script:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/master/openeogeotrellis/deploy/submit_batch_job.sh

Vito has docker images available with all dependencies: https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/docker/CentOS/Dockerfile.centos8-openeo

https://artifactory.vgt.vito.be/webapp/#/artifacts/browse/simple/General/vito-docker/centos8-openeo/latest

Full K8S deploy info:

https://github.com/Open-EO/openeo-geotrellis-kubernetes

mariojmdavid commented 3 years ago

@backeb someone mentioned terraform templates, can you check and put them here in this issue?

enolfc commented 3 years ago

The k8s setup in creodias has these docs: https://creodias.eu/faq-other/-/asset_publisher/SIs09LQL6Gct/content/how-to-configure-kubernetes using kubespray: https://github.com/kubernetes-sigs/kubespray

backeb commented 3 years ago

@backeb someone mentioned terraform templates, can you check and put them here in this issue?

@jdries could you please provide links to the OpenEO / Dask terraform template for @mariojmdavid and his team?

mariojmdavid commented 3 years ago

please add my coleagues Tiago: tiagofglip Zacarias: zbenta Miguel: miguelviana95

backeb commented 3 years ago

@mariojmdavid @tiagofglip @zbenta @miguelviana95 could you update us on your progress installing openOE platform? If you have any questions please contact @jdries

cc @gena @avgils

zbenta commented 3 years ago

@mariojmdavid @tiagofglip @zbenta @miguelviana95 could you update us on your progress installing openOE platform? If you have any questions please contact @jdries

cc @gena @avgils

Sorry for the late reply @backeb, we were all on vacation, me and @tiagofglip wil take a look at openOE during this week, as soon as we have more info we'll get back to you.

zbenta commented 3 years ago

We are having some issues while deploying the cluster on our kubernetes cluster.

This repo doesn't exist: helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator

We had to change it to: helm repo add incubator https://charts.helm.sh/incubator

We are also having issues with the image that we have in the https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/kubernetes/openeo.yaml file. The image: vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:0.1.8 in not available, we had to change the version to latest.

The next issue, and the one we are still trying to overcome is the fact that whenever we try to deploy the openEO spark job, we get the following error in octant:

MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "openeo-geotrellis-1627996966941-driver-conf-map" not found

Any thoughts?

zbenta commented 3 years ago

We are having some issues while deploying the cluster on our kubernetes cluster.

This repo doesn't exist: helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator

We had to change it to: helm repo add incubator https://charts.helm.sh/incubator

We are also having issues with the image that we have in the https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/kubernetes/openeo.yaml file. The image: vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:0.1.8 in not available, we had to change the version to latest.

The next issue, and the one we are still trying to overcome is the fact that whenever we try to deploy the openEO spark job, we get the following error in octant:

MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "openeo-geotrellis-1627996966941-driver-conf-map" not found

Any thoughts?

image (5)

backeb commented 3 years ago

@jdries could you assist with the above?

jdries commented 3 years ago

I forwarded the problem to my devops colleague, who may know a bit better what this is about!

zbenta commented 3 years ago

We've fount this post with the same issue:

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/946

They are stating that the problem might be "I found that operator creates the driver pod prior to the relative CM."

We also removed the version spark-operator we had installed previously and installed the one from google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/charts/spark-operator-chart

But the result is always the same.

jdries commented 3 years ago

First feedback: it vaguely looks familiar, is it possible that this is just a warning, and can actually be ignored, because the rest of the steps do seem to work? @backeb Could you add github user 'tcassaert' to this project, so my colleague can interact directly if needed?

backeb commented 3 years ago

@backeb Could you add github user 'tcassaert' to this project, so my colleague can interact directly if needed?

Done ✅

tiagofglip commented 3 years ago

Well, on the logs we can see also this error:

+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.233.123.51 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kubernetes.py
21/08/03 15:20:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
python: can't open file '/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kubernetes.py': [Errno 2] No such file or directory

Don't know if it's because we're using a different image: vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:latest, and not 0.1.8 version

jdries commented 3 years ago

We had a look, this is indeed the real issue. The image is correct, and contains latest software, but the deployment files do need an update, because we internally switched to a more automated deploy based on helm charts and terraform. We'll look into the best option to get you going again!

tcassaert commented 3 years ago

They are stating that the problem might be "I found that operator creates the driver pod prior to the relative CM."

That could indeed be the problem. We see the same error if we look into the pod events, but the pod starts without a problem and I haven't seen anything missing or haven't seen any problems regarding that configmap.

tcassaert commented 3 years ago

We've looked into the best way to get you guys back on track to deploy everything.

The https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/kubernetes/openeo.yaml file is pretty old and we've switched to a Helm based deployment ourselves.

This Helm based deployment is using the sparkapplication helm chart, located at https://github.com/Open-EO/openeo-geotrellis-kubernetes/tree/master/kubernetes/charts/sparkapplication.

The README.md contains a sample values.yaml file with the most important variables.

The Helm chart can be used with:

helm repo add helm-charts https://artifactory.vgt.vito.be/helm-charts

Version 0.3.6 is the best tested one. The latest version is using another Ingress type, but is not currently in use by us.

zbenta commented 3 years ago

Thanks for all you suppport,

We have rebuilt the cluster and tried to deploy OpenEO as follows:

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator kubectl create namespace spark-jobs helm install spark-operator/spark-operator --generate-name --create-namespace --namespace spark-operator --set sparkJobNamespace=spark-jobs --set enableWebhook=true helm list -n spark-operator kubectl get pods -n spark-operator kubectl get serviceaccounts -n spark-jobs cd openeo/ cd openeo-geotrellis-kubernetes/ cd kubernetes/ cd charts/ cd sparkapplication/ vim values_2.yaml helm repo add sparkapp https://artifactory.vgt.vito.be/helm-charts helm install sparkapp/sparkapplication --generate-name --namespace spark-jobs -f values_2.yaml

Our values_2.yaml file, which was created/copied as per your sample file, is as follows:

`--- image: "vito-docker.artifactory.vgt.vito.be/openeo-geotrellis" imageVersion: "latest" jmxExporterJar: "/opt/jmx_prometheus_javaagent-0.13.0.jar" mainApplicationFile: "local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py" serviceAccount: "openeo" volumes:

While taking a look at the logs we can see the following:

  SparkApplication sparkapplication-1628151789 failed: failed to run spark-submit for SparkApplication spark-jobs/sparkapplication-1628151789: WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/08/05 08:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/08/05 08:30:00 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 21/08/05 08:30:00 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image. 21/08/05 08:30:00 WARN DriverCommandFeatureStep: spark.kubernetes.pyspark.pythonVersion was deprecated in Spark 3.1. Please set 'spark.pyspark.python' and 'spark.pyspark.driver.python' configurations or PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables instead. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.233.0.1/api/v1/namespaces/spark-jobs/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "sparkapplication-1628151789-driver" is forbidden: error looking up service account spark-jobs/openeo: serviceaccount "openeo" not found. at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:589) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:526) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:492) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:879) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:1030) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 21/08/05 08:30:01 INFO ShutdownHookManager: Shutdown hook called 21/08/05 08:30:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-250423ae-61ec-40f6-9b1c-3d892a8b0af7

Any thoughts?

zbenta commented 3 years ago

Just noticed that we hadn't changed the service account value to the one existing in our setup, trying to deploy the sparp-application again. We'll get back to you soon with news.

zbenta commented 3 years ago

We belive that our problem is regarding to the image that we are using in the container, we have tried several versions and we are always getting the same info on the logs:

`+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")

We are using images from this repo: https://vito-docker.artifactory.vgt.vito.be/webapp/#/packages/docker/openeo-geotrellis/?state=eyJxdWVyeSI6eyJwa2ciOiJvcGVuZW8ifX0%3D

tiagofglip commented 3 years ago

The last log zbenta post here happened when commenting jmxExporterJar: "/opt/jmx_prometheus_javaagent-0.13.0.jar in values.yaml.

If that line is not commented, the error is the following:

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
    at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
Caused by: java.io.FileNotFoundException: /etc/metrics/conf/prometheus.yaml (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileReader.<init>(FileReader.java:72)
    at io.prometheus.jmx.shaded.io.prometheus.jmx.JmxCollector.<init>(JmxCollector.java:75)
    at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:29)
    ... 6 more
FATAL ERROR in native method: processing of -javaagent failed
jdries commented 3 years ago

Commenting out the Prometheus exporter is a good idea, it's an optional part for metrics. So we can focus on the first error, which claims something is wrong with python syntax. We deployed the latest openeo-geotrellis image today, and that worked fine. @tcassaert does this syntax error look familiar?

zbenta commented 3 years ago

Thanks for the indor @jdries, we've tested it with the latest image and the "syntax error persists". It looks like the python interpreter doens't like the function definition. I even tried running the image in my local machine to see if there was any file missing, or if the paths on the yaml were wrong. While running the app on the local docker image I get the following output:

root@4c29634cd856:/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy# python3.7 kube.py Adding process 'pi' without implementation Adding process 'e' without implementation starting spark context 21/08/05 12:37:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Traceback (most recent call last): File "kube.py", line 89, in main() File "kube.py", line 63, in main app = build_app(backend_implementation=GeoPySparkBackendImplementation()) File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/backend.py", line 257, in init else ZooKeeperServiceRegistry() File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/service_registry.py", line 121, in init with self._zk_client() as zk: File "/usr/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/service_registry.py", line 201, in _zk_client zk.start() File "/usr/local/lib/python3.7/dist-packages/kazoo/client.py", line 635, in start raise self.handler.timeout_exception("Connection time-out") kazoo.handlers.threading.KazooTimeoutError: Connection time-out

jdries commented 3 years ago

This second attempt doesn't seem to have the syntax issue, because it gets past line 50. It throws an error because of not finding zookeeper nodes. You can disable zookeeper usage by emulating a CI context. Can you try setting the environment variable 'TRAVIS' to 1? Zookeeper is also optional, so it should work like this.

zbenta commented 3 years ago

This second attempt doesn't seem to have the syntax issue, because it gets past line 50. It throws an error because of not finding zookeeper nodes. You can disable zookeeper usage by emulating a CI context. Can you try setting the environment variable 'TRAVIS' to 1? Zookeeper is also optional, so it should work like this.

The log that we've shown before is when we run the docker image in our local machines, it has nothing to do with the log that kubernetes gives us. the kubernetes log shows the following:

21/08/05 12:29:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 50 def setup_batch_jobs() -> None: ^ SyntaxError: invalid syntax log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

zbenta commented 3 years ago

Where should we define the TRAVIS env var in the executor or in the driver?

jdries commented 3 years ago

specifying TRAVIS in the driver should be sufficient. I indeed understand that the syntax error only occurs when you run on Kubernetes, and not when you run it locally, or in our Kubernetes. The only thing that's perhaps special about that line is the return type '-> None', but if you are using Python 3.7 (which seems to be the case), that should work...

zbenta commented 3 years ago

This is going to be a long one, sorry.

Just to make things clear we have the spark operator up and running on our k8s cluster.

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-operator get pods
NAME                                         READY   STATUS    RESTARTS   AGE
spark-operator-1628151073-5f74549799-xn27k   1/1     Running   0          6h7m

We have deployed the sparkapplication chart unsing the following command:

helm install myspark sparkapp/sparkapplication  --namespace spark-jobs -f values.yaml

The vaules.yaml is as follows, we've tried both with and without TRAVIS:"1" on the driver section :

image: "vito-docker.artifactory.vgt.vito.be/openeo-geotrellis"
imageVersion: "latest"
#jmxExporterJar: "/opt/jmx_prometheus_javaagent-0.13.0.jar"
mainApplicationFile: "local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py"
serviceAccount: "spark-operator-1628151073-spark"
volumes:
  - name: "eodata"
    hostPath:
      path: "/eodata"
      type: "DirectoryOrCreate"
volumeMounts:
  - name: "eodata"
    mountPath: "/eodata"
executor:
  memory: "4096m"
  cpu: 5
  envVars:
    OPENEO_CATALOG_FILES: "/opt/layercatalog.json"
    OPENEO_S1BACKSCATTER_ELEV_GEOID: "/opt/openeo-vito-aux-data/egm96.grd"
    OTB_HOME: "/opt/orfeo-toolbox"
    OTB_APPLICATION_PATH: "/opt/orfeo-toolbox/lib/otb/applications"
    KUBE: "true"
    GDAL_NUM_THREADS: "2"
  javaOptions: "-Dlog4j.configuration=log4j.properties -Dscala.concurrent.context.numThreads=4 -Dscala.concurrent.context.maxThreads=4"
driver:
  memory: "4096m"
  cpu: 5
  envVars:
    KUBE: "true"
    KUBE_OPENEO_API_PORT: "50001"
    DRIVER_IMPLEMENTATION_PACKAGE: "openeogeotrellis"
    OPENEO_CATALOG_FILES: "/opt/layercatalog.json"
    OPENEO_S1BACKSCATTER_ELEV_GEOID: "/opt/openeo-vito-aux-data/egm96.grd"
    OTB_HOME: "/opt/orfeo-toolbox"
    OTB_APPLICATION_PATH: "/opt/orfeo-toolbox/lib/otb/applications"
  javaOptions: "-Dlog4j.configuration=log4j.properties -Dscala.concurrent.context.numThreads=6 -Dpixels.treshold=1000000"
sparkConf:
  "spark.executorEnv.DRIVER_IMPLEMENTATION_PACKAGE": "openeogeotrellis"
  "spark.extraListeners": "org.openeo.sparklisteners.CancelRunawayJobListener"
  "spark.appMasterEnv.DRIVER_IMPLEMENTATION_PACKAGE": "openeogeotrellis"
  "spark.executorEnv.GDAL_NUM_THREADS": "2"
  "spark.executorEnv.GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR"
jarDependencies:
  - 'local:///opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar'
  - 'local:///opt/geotrellis-backend-assembly-0.4.6-openeo.jar'
fileDependencies:
  - 'local:///opt/layercatalog.json'
service:
  enabled: true
  port: 50001
# ingress:
#   annotations:
#     kubernetes.io/ingress.class: traefik
#   enabled: true
#   hosts:
#   - host: openeo.example.com
#     paths:
#       - '/'
rbac:
 create: false
 serviceAccountName: spark-operator-1628151073-spark
# spark_ui:
#   port: 4040
#   ingress:
#     enabled: true
#     annotations:
#       kubernetes.io/ingress.class: traefik
#     hosts:
#       - host: spark-ui.openeo.example.com
#         paths:
#           - '/'

What we get when we consult the log, after having deployed the chart is as follows:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-jobs logs  myspark-driver
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.233.120.37 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py
21/08/05 14:02:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 50
    def setup_batch_jobs() -> None:
                           ^
SyntaxError: invalid syntax
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

The driver is the only pod we have, but it is in an error state:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-jobs get pods
NAME             READY   STATUS   RESTARTS   AGE
myspark-driver   0/1     Error    0          20m
tcassaert commented 3 years ago

This is not something we encountered when setting it up.

What version of the spark operator are you using?

zbenta commented 3 years ago

This is not something we encountered when setting it up.

What version of the spark operator are you using?

Whe are using the following:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# helm list -n spark-operator
NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION        
spark-operator-1628151073       spark-operator  1               2021-08-05 08:11:22.672795838 +0000 UTC deployed        spark-operator-1.1.6    v1beta2-1.2.3-3.1.1

From google:

spark-operator  https://googlecloudplatform.github.io/spark-on-k8s-operator

Because the one that is in the doceumentation is deprecated and gave us errors while installing it.

tcassaert commented 3 years ago

We're currently on

spark-operator  spark-operator  1               2021-07-05 12:31:50.28283282 +0000 UTC  deployed        sparkoperator-0.8.4     v1beta2-1.2.0-3.0.0

So maybe you could try this version?

jdries commented 3 years ago

Next to that, I also just made a commit to remove the '-> None', in fact, specifying an empty return type like this is not really necessary nor helpful in Python. This may not solve the actual issue, but would hopefully get us a bit further and maybe reveal the underlying issue a bit better.

zbenta commented 3 years ago

We're currently on

spark-operator  spark-operator  1               2021-07-05 12:31:50.28283282 +0000 UTC  deployed        sparkoperator-0.8.4     v1beta2-1.2.0-3.0.0

So maybe you could try this version?

Since the chart repo that is shown on the documentation is not available, we searched for alternative ones and found the following:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# helm repo list
NAME            URL                                                        
incubator       https://charts.helm.sh/incubator                                                                   
spark-operator  https://googlecloudplatform.github.io/spark-on-k8s-operator

The versions available in each one of them are:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# helm search repo incubator/sparkoperator
NAME                    CHART VERSION   APP VERSION             DESCRIPTION                                       
incubator/sparkoperator 0.8.6           v1beta2-1.2.0-3.0.0     DEPRECATED A Helm chart for Spark on Kubernetes...
[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# helm search repo spark-operator/spark-operator
NAME                            CHART VERSION   APP VERSION             DESCRIPTION                                  
spark-operator/spark-operator   1.1.6           v1beta2-1.2.3-3.1.1     A Helm chart for Spark on Kubernetes operator

What repo are you using, can you provide us with the url?

tcassaert commented 3 years ago

I can't find the 0.8.4 version anymore in any upstream repo either.

But we've mirrored it to our artifactory, so you should be able to find the 0.8.4 in https://artifactory.vgt.vito.be/helm-charts. The chart itself is named sparkoperator.

zbenta commented 3 years ago

Thanks @tcassaert and @jdries for your suppport.

Here goes another long one :smile:

We have given a step forward, but still the pods won't run.

We've installed the chart versions as per your recomendations

The sparkoperator version 0.8.4

helm install myoperator sparkapp/sparkoperator --create-namespace --namespace spark-operator --set sparkJobNamespace=spark-jobs --set enableWebhook=true --version=0.8.4

The sparkapplication version 0.3.6

helm install myspark sparkapp/sparkapplication --namespace spark-jobs -f values_2.yaml --version=0.3.6

The image we are using is: vito-docker.artifactory.vgt.vito.be/openeo-geotrellis in the latest version

Here is some more information regarding our deployment of the sparkapplication, we hope this helps:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-jobs describe sparkapplications myspark
Name:         myspark
Namespace:    spark-jobs
Labels:       app.kubernetes.io/managed-by=Helm
              chartname=sparkapplication
              release=myspark
              revision=1
              sparkVersion=2.4.5
              version=0.3.6
Annotations:  meta.helm.sh/release-name: myspark
              meta.helm.sh/release-namespace: spark-jobs
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2021-08-06T09:08:57Z
  Generation:          1
  Managed Fields:
    API Version:  sparkoperator.k8s.io/v1beta2
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
          f:chartname:
          f:release:
          f:revision:
          f:sparkVersion:
          f:version:
      f:spec:
        .:
        f:deps:
          .:
          f:files:
          f:jars:
        f:driver:
          .:
          f:cores:
          f:envVars:
            .:
            f:DRIVER_IMPLEMENTATION_PACKAGE:
            f:IMAGE_NAME:
            f:KUBE:
            f:KUBE_OPENEO_API_PORT:
            f:OPENEO_CATALOG_FILES:
            f:OPENEO_S1BACKSCATTER_ELEV_GEOID:
            f:OTB_APPLICATION_PATH:
            f:OTB_HOME:
            f:TRAVIS:
          f:hostNetwork:
          f:javaOptions:
          f:labels:
            .:
            f:app.kubernetes.io/name:
            f:release:
            f:revision:
            f:sparkVersion:
            f:version:
          f:memory:
          f:serviceAccount:
          f:volumeMounts:
        f:executor:
          .:
          f:cores:
          f:envVars:
            .:
            f:GDAL_NUM_THREADS:
            f:KUBE:
            f:OPENEO_CATALOG_FILES:
            f:OPENEO_S1BACKSCATTER_ELEV_GEOID:
            f:OTB_APPLICATION_PATH:
            f:OTB_HOME:
          f:hostNetwork:
          f:instances:
          f:javaOptions:
          f:labels:
            .:
            f:release:
            f:revision:
            f:sparkVersion:
            f:version:
          f:memory:
          f:serviceAccount:
          f:volumeMounts:
        f:image:
        f:imagePullPolicy:
        f:mainApplicationFile:
        f:mode:
        f:pythonVersion:
        f:restartPolicy:
          .:
          f:onFailureRetries:
          f:onFailureRetryInterval:
          f:onSubmissionFailureRetries:
          f:onSubmissionFailureRetryInterval:
          f:type:
        f:sparkConf:
          .:
          f:spark.appMasterEnv.DRIVER_IMPLEMENTATION_PACKAGE:
          f:spark.executorEnv.DRIVER_IMPLEMENTATION_PACKAGE:
          f:spark.executorEnv.GDAL_DISABLE_READDIR_ON_OPEN:
          f:spark.executorEnv.GDAL_NUM_THREADS:
          f:spark.extraListeners:
        f:sparkVersion:
        f:type:
        f:volumes:
    Manager:      helm
    Operation:    Update
    Time:         2021-08-06T09:08:57Z
    API Version:  sparkoperator.k8s.io/v1beta2
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:applicationState:
          .:
          f:errorMessage:
          f:state:
        f:driverInfo:
          .:
          f:podName:
          f:webUIAddress:
          f:webUIPort:
          f:webUIServiceName:
        f:executionAttempts:
        f:executorState:
          .:
          f:myspark-1628241205030-exec-1:
        f:lastSubmissionAttemptTime:
        f:sparkApplicationId:
        f:submissionAttempts:
        f:submissionID:
        f:terminationTime:
    Manager:         spark-operator
    Operation:       Update
    Time:            2021-08-06T09:13:56Z
  Resource Version:  371613
  UID:               3f08781d-9286-43ce-a1f0-3df5d5f8cec9
Spec:
  Deps:
    Files:
      local:///opt/layercatalog.json
    Jars:
      local:///opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar
      local:///opt/geotrellis-backend-assembly-0.4.6-openeo.jar
  Driver:
    Cores:  2
    Env Vars:
      DRIVER_IMPLEMENTATION_PACKAGE:    openeogeotrellis
      IMAGE_NAME:                       vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:latest
      KUBE:                             true
      KUBE_OPENEO_API_PORT:             50001
      OPENEO_CATALOG_FILES:             /opt/layercatalog.json
      OPENEO_S1BACKSCATTER_ELEV_GEOID:  /opt/openeo-vito-aux-data/egm96.grd
      OTB_APPLICATION_PATH:             /opt/orfeo-toolbox/lib/otb/applications
      OTB_HOME:                         /opt/orfeo-toolbox
      TRAVIS:                           1
    Host Network:                       false
    Java Options:                       -Dlog4j.configuration=log4j.properties -Dscala.concurrent.context.numThreads=6 -Dpixels.treshold=1000000
    Labels:
      app.kubernetes.io/name:  myspark-driver
      Release:                 myspark
      Revision:                1
      Spark Version:           2.4.5
      Version:                 0.3.6
    Memory:                    4096m
    Service Account:           myoperator-spark
    Volume Mounts:
      Mount Path:  /eodata
      Name:        eodata
  Executor:
    Cores:  2
    Env Vars:
      GDAL_NUM_THREADS:                 2
      KUBE:                             true
      OPENEO_CATALOG_FILES:             /opt/layercatalog.json
      OPENEO_S1BACKSCATTER_ELEV_GEOID:  /opt/openeo-vito-aux-data/egm96.grd
      OTB_APPLICATION_PATH:             /opt/orfeo-toolbox/lib/otb/applications
      OTB_HOME:                         /opt/orfeo-toolbox
    Host Network:                       false
    Instances:                          1
    Java Options:                       -Dlog4j.configuration=log4j.properties -Dscala.concurrent.context.numThreads=4 -Dscala.concurrent.context.maxThreads=4
    Labels:
      Release:        myspark
      Revision:       1
      Spark Version:  2.4.5
      Version:        0.3.6
    Memory:           4096m
    Service Account:  myoperator-spark
    Volume Mounts:
      Mount Path:         /eodata
      Name:               eodata
  Image:                  vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:latest
  Image Pull Policy:      IfNotPresent
  Main Application File:  local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py
  Mode:                   cluster
  Python Version:         3
  Restart Policy:
    On Failure Retries:                    3
    On Failure Retry Interval:             10
    On Submission Failure Retries:         5
    On Submission Failure Retry Interval:  20
    Type:                                  OnFailure
  Spark Conf:
    spark.appMasterEnv.DRIVER_IMPLEMENTATION_PACKAGE:  openeogeotrellis
    spark.executorEnv.DRIVER_IMPLEMENTATION_PACKAGE:   openeogeotrellis
    spark.executorEnv.GDAL_DISABLE_READDIR_ON_OPEN:    EMPTY_DIR
    spark.executorEnv.GDAL_NUM_THREADS:                2
    spark.extraListeners:                              org.openeo.sparklisteners.CancelRunawayJobListener
  Spark Version:                                       2.4.5
  Type:                                                Python
  Volumes:
    Host Path:
      Path:  /eodata
      Type:  DirectoryOrCreate
    Name:    eodata
Status:
  Application State:
    Error Message:  driver container failed with ExitCode: 1, Reason: Error
    State:          FAILED
  Driver Info:
    Pod Name:             myspark-driver
    Web UI Address:       10.233.33.186:4040
    Web UI Port:          4040
    Web UI Service Name:  myspark-ui-svc
  Execution Attempts:     4
  Executor State:
    myspark-1628241205030-exec-1:  FAILED
  Last Submission Attempt Time:    2021-08-06T09:13:15Z
  Spark Application Id:            spark-fb1aeabd18bf492a894bc90788341642
  Submission Attempts:             1
  Submission ID:                   0c8e66a8-ea83-4120-a0e2-431becbd31e0
  Termination Time:                2021-08-06T09:13:54Z
Events:
  Type     Reason                        Age                From            Message
  ----     ------                        ----               ----            -------
  Normal   SparkApplicationAdded         18m                spark-operator  SparkApplication myspark was added, enqueuing it for submission
  Normal   SparkExecutorPending          18m                spark-operator  Executor myspark-1628240951293-exec-1 is pending
  Normal   SparkExecutorRunning          18m                spark-operator  Executor myspark-1628240951293-exec-1 is running
  Normal   SparkExecutorPending          17m (x2 over 17m)  spark-operator  Executor myspark-1628241024502-exec-1 is pending
  Normal   SparkExecutorRunning          17m                spark-operator  Executor myspark-1628241024502-exec-1 is running
  Normal   SparkExecutorPending          15m                spark-operator  Executor myspark-1628241115447-exec-1 is pending
  Normal   SparkExecutorRunning          15m                spark-operator  Executor myspark-1628241115447-exec-1 is running
  Warning  SparkApplicationPendingRerun  14m (x3 over 17m)  spark-operator  SparkApplication myspark is pending rerun
  Normal   SparkApplicationSubmitted     14m (x4 over 18m)  spark-operator  SparkApplication myspark was submitted successfully
  Normal   SparkDriverRunning            14m (x4 over 18m)  spark-operator  Driver myspark-driver is running
  Normal   SparkExecutorPending          14m                spark-operator  Executor myspark-1628241205030-exec-1 is pending
  Normal   SparkExecutorRunning          14m                spark-operator  Executor myspark-1628241205030-exec-1 is running
  Warning  SparkDriverFailed             13m (x4 over 17m)  spark-operator  Driver myspark-driver failed

The pods both start, this is new for us, they stay up for about 25 second, but then, they get destroyed.

The log we get is the following, we hope this will also help:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-jobs logs -f myspark-driver
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.7.3'
+ export PYTHON_VERSION=3.7.3
+ PYTHON_VERSION=3.7.3
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.233.120.51 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py
21/08/06 09:13:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Adding process 'e' without implementation
Adding process 'pi' without implementation
starting spark context
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/08/06 09:13:22 INFO SparkContext: Running Spark version 2.4.5
21/08/06 09:13:22 INFO SparkContext: Submitted application: myspark
21/08/06 09:13:23 INFO SecurityManager: Changing view acls to: root
21/08/06 09:13:23 INFO SecurityManager: Changing modify acls to: root
21/08/06 09:13:23 INFO SecurityManager: Changing view acls groups to: 
21/08/06 09:13:23 INFO SecurityManager: Changing modify acls groups to: 
21/08/06 09:13:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/08/06 09:13:23 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
21/08/06 09:13:23 INFO SparkEnv: Registering MapOutputTracker
21/08/06 09:13:23 INFO SparkEnv: Registering BlockManagerMaster
21/08/06 09:13:23 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/06 09:13:23 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/06 09:13:23 INFO DiskBlockManager: Created local directory at /var/data/spark-7f813770-0240-4171-8149-2e472bb9d989/blockmgr-05e37d14-5ef9-4ffc-aff3-948117d3b1ac
21/08/06 09:13:23 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
21/08/06 09:13:23 INFO SparkEnv: Registering OutputCommitCoordinator
21/08/06 09:13:23 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/08/06 09:13:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc:4040
21/08/06 09:13:23 INFO SparkContext: Added JAR local:///opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar at file:/opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar with timestamp 1628241203791
21/08/06 09:13:23 INFO SparkContext: Added JAR local:///opt/geotrellis-backend-assembly-0.4.6-openeo.jar at file:/opt/geotrellis-backend-assembly-0.4.6-openeo.jar with timestamp 1628241203792
21/08/06 09:13:23 WARN SparkContext: File with 'local' scheme is not supported to add to file server, since it is already available on every node.
21/08/06 09:13:25 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
21/08/06 09:13:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
21/08/06 09:13:25 INFO NettyBlockTransferService: Server created on myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc:7079
21/08/06 09:13:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/06 09:13:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc, 7079, None)
21/08/06 09:13:25 INFO BlockManagerMasterEndpoint: Registering block manager myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc, 7079, None)
21/08/06 09:13:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc, 7079, None)
21/08/06 09:13:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc, 7079, None)
21/08/06 09:13:25 INFO CancelRunawayJobListener: initialized with timeout PT15M
21/08/06 09:13:25 INFO SparkContext: Registered listener org.openeo.sparklisteners.CancelRunawayJobListener
21/08/06 09:13:32 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.233.120.52:44494) with ID 1
21/08/06 09:13:32 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[2021-08-06 09:13:32,186] INFO in openeogeotrellis.service_registry: Creating new InMemoryServiceRegistry: <openeogeotrellis.service_registry.InMemoryServiceRegistry object at 0x7fcee900ce10>
[2021-08-06 09:13:32,187] INFO in openeogeotrellis.layercatalog: Reading layer catalog metadata from /opt/layercatalog.json
[2021-08-06 09:13:32,187] INFO in openeogeotrellis.layercatalog: Updating SENTINEL2_L1C metadata from https://finder.creodias.eu:Sentinel2
[2021-08-06 09:13:32,188] INFO in openeogeotrellis.opensearch: Getting collection metadata from https://finder.creodias.eu/resto/collections.json
21/08/06 09:13:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.233.120.52:42675 with 2.1 GB RAM, BlockManagerId(1, 10.233.120.52, 42675, None)
[2021-08-06 09:13:32,671] INFO in openeogeotrellis.layercatalog: Updating SENTINEL2_L2A metadata from https://finder.creodias.eu:Sentinel2
[2021-08-06 09:13:32,672] INFO in openeogeotrellis.opensearch: Getting collection metadata from https://finder.creodias.eu/resto/collections.json
/usr/local/lib/python3.7/dist-packages/openeo_driver/views.py:208: UserWarning: The name 'openeo' is already registered for this blueprint. Use 'name=' to provide a unique name. This will become an error in Flask 2.1.
  app.register_blueprint(bp, url_prefix='/openeo/<version>')
[2021-08-06 09:13:33,254] INFO in openeo_driver.views: App info logging enabled!
[2021-08-06 09:13:33,255] DEBUG in openeo_driver.views: App debug logging enabled!
[2021-08-06 09:13:33,255] INFO in openeo_driver.server: StandaloneApplication options: {'bind': '10.233.120.51:50001', 'workers': 1, 'threads': 10, 'worker_class': 'gthread', 'timeout': 1000, 'loglevel': 'DEBUG', 'accesslog': '-', 'errorlog': '-'}
[2021-08-06 09:13:33,255] INFO in openeo_driver.server: Creating StandaloneApplication
[2021-08-06 09:13:33,257] INFO in openeo_driver.server: Running StandaloneApplication
[2021-08-06 09:13:33 +0000] [48] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['10.233.120.51:50001']
  backlog: 2048
  workers: 1
  worker_class: gthread
  threads: 10
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 1000
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /opt/spark/work-dir
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: -
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: DEBUG
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: gunicorn
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7fcf25436620>
  on_reload: <function OnReload.on_reload at 0x7fcf25436730>
  when_ready: <function run_gunicorn.<locals>.when_ready at 0x7fcee9049e18>
  pre_fork: <function Prefork.pre_fork at 0x7fcf25436950>
  post_fork: <function Postfork.post_fork at 0x7fcf25436a60>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fcf25436b70>
  worker_int: <function WorkerInt.worker_int at 0x7fcf25436c80>
  worker_abort: <function WorkerAbort.worker_abort at 0x7fcf25436d90>
  pre_exec: <function PreExec.pre_exec at 0x7fcf25436ea0>
  pre_request: <function PreRequest.pre_request at 0x7fcf253ce048>
  post_request: <function PostRequest.post_request at 0x7fcf253ce0d0>
  child_exit: <function ChildExit.child_exit at 0x7fcf253ce1e0>
  worker_exit: <function WorkerExit.worker_exit at 0x7fcf253ce2f0>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fcf253ce400>
  on_exit: <function OnExit.on_exit at 0x7fcf253ce510>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2021-08-06 09:13:33 +0000] [48] [INFO] Starting gunicorn 20.1.0
[2021-08-06 09:13:33 +0000] [48] [DEBUG] Arbiter booted
[2021-08-06 09:13:33 +0000] [48] [INFO] Listening at: http://10.233.120.51:50001 (48)
[2021-08-06 09:13:33 +0000] [48] [INFO] Using worker: gthread
[2021-08-06 09:13:33,266] INFO in openeo_driver.server: when_ready: <gunicorn.arbiter.Arbiter object at 0x7fcee87360b8>
[2021-08-06 09:13:33 +0000] [48] [INFO] Gunicorn info logging enabled!
[2021-08-06 09:13:33,266] INFO in flask: Flask info logging enabled!
[2021-08-06 09:13:33,266] INFO in openeogeotrellis.deploy: Trying to load 'custom_processes' with PYTHONPATH ['/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy', '/var/data/spark-7f813770-0240-4171-8149-2e472bb9d989/spark-275d7baf-5ae5-416d-aab7-b3335518b688/userFiles-db26d671-239b-47ac-a000-fd2dfda9158b', '/opt/spark/python/lib/pyspark.zip', '/opt/spark/python/lib/py4j-0.10.7-src.zip', '/opt/spark/jars/spark-core_2.11-2.4.5.jar', '/opt/spark/python/lib/py4j-*.zip', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages']
[2021-08-06 09:13:33,267] INFO in openeogeotrellis.deploy: 'custom_processes' not loaded: ModuleNotFoundError("No module named 'custom_processes'").
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 89, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 84, in main
    on_started=on_started
  File "/usr/local/lib/python3.7/dist-packages/openeo_driver/server.py", line 119, in run_gunicorn
    application.run()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/arbiter.py", line 198, in run
    self.start()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/arbiter.py", line 167, in start
    self.cfg.when_ready(self)
  File "/usr/local/lib/python3.7/dist-packages/openeo_driver/server.py", line 113, in when_ready
    on_started()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 60, in on_started
    setup_batch_jobs()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 51, in setup_batch_jobs
    with JobRegistry() as job_registry:
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/job_registry.py", line 183, in __enter__
    self._zk.start()
  File "/usr/local/lib/python3.7/dist-packages/kazoo/client.py", line 635, in start
    raise self.handler.timeout_exception("Connection time-out")
kazoo.handlers.threading.KazooTimeoutError: Connection time-out
21/08/06 09:13:53 INFO SparkContext: Invoking stop() from shutdown hook
21/08/06 09:13:53 INFO SparkUI: Stopped Spark web UI at http://myspark-1620e47b1abce7d6-driver-svc.spark-jobs.svc:4040
21/08/06 09:13:53 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
21/08/06 09:13:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
21/08/06 09:13:53 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
21/08/06 09:13:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/08/06 09:13:53 INFO MemoryStore: MemoryStore cleared
21/08/06 09:13:53 INFO BlockManager: BlockManager stopped
21/08/06 09:13:53 INFO BlockManagerMaster: BlockManagerMaster stopped
21/08/06 09:13:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/08/06 09:13:53 INFO SparkContext: Successfully stopped SparkContext
21/08/06 09:13:53 INFO ShutdownHookManager: Shutdown hook called
21/08/06 09:13:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-4f3d9b89-3e4b-44f8-bbf9-9a71bd3e859a
21/08/06 09:13:53 INFO ShutdownHookManager: Deleting directory /var/data/spark-7f813770-0240-4171-8149-2e472bb9d989/spark-275d7baf-5ae5-416d-aab7-b3335518b688/pyspark-964de7f9-6a8c-43c9-9e4b-0809e34d5d93
21/08/06 09:13:53 INFO ShutdownHookManager: Deleting directory /var/data/spark-7f813770-0240-4171-8149-2e472bb9d989/spark-275d7baf-5ae5-416d-aab7-b3335518b688

The final error above is similar to the one we get when we run the docker image on our local machine and try to start kube.py manually:

root@ab697e5f3521:/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy# python3
python3            python3-config     python3.7          python3.7-config   python3.7m         python3.7m-config  python3m           python3m-config    
root@ab697e5f3521:/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy# python3.7 kube.py 
Adding process 'e' without implementation
Adding process 'pi' without implementation
starting spark context
21/08/06 09:14:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "kube.py", line 89, in <module>
    main()
  File "kube.py", line 63, in main
    app = build_app(backend_implementation=GeoPySparkBackendImplementation())
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/backend.py", line 257, in __init__
    else ZooKeeperServiceRegistry()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/service_registry.py", line 121, in __init__
    with self._zk_client() as zk:
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/service_registry.py", line 201, in _zk_client
    zk.start()
  File "/usr/local/lib/python3.7/dist-packages/kazoo/client.py", line 635, in start
    raise self.handler.timeout_exception("Connection time-out")
kazoo.handlers.threading.KazooTimeoutError: Connection time-out
tcassaert commented 3 years ago

The KazooTimeoutError shows that it is still trying to connect to zookeeper nodes.

Have you added the TRAVIS=1 environment variable?

zbenta commented 3 years ago

The KazooTimeoutError shows that it is still trying to connect to zookeeper nodes.

Have you added the TRAVIS=1 environment variable?

Yes, we 've added it in the driver section of our yaml file:

driver:
  memory: "4096m"
  cpu: 5
  envVars:
    TRAVIS: "1"
    KUBE: "true"
jdries commented 3 years ago

I had a look at the code, it seems that there was still one service depending on zookeeper, that does not have this check in place. The next image build should contain that fix. The other solution is to also start zookeeper on K8S, Thomas could inform on how we do that. (Maybe another helm chart?)

On a side note, I'm about to leave on holiday for 3 weeks myself, but Thomas can still provide assistance.

What's also important here, is that openEO will have to connect to certain datasets eventually. So as a minimum, this data needs to be available in object storage or on shared disk. If on disk, openEO can discover it with some glob patterns. If object storage, a STAC catalog is needed, so openEO can find the data. This last solution is the most future proof.

tiagofglip commented 3 years ago

We already tested with latest image but it fails again, in the same step :/

jdries commented 3 years ago

Indeed, double checked the code, and I made a mistake in that commit, hope it's better on the next build.

zbenta commented 3 years ago

Good morning,

Hope you all had a nice weekend, @jdries, thanks for the last effort and release you made saturday, unfortunately we have the same issue:

[root@openeo-cluster-k8s-master-nf-1 sparkapplication]# kubectl -n spark-jobs logs -f myspark-driver
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.7.3'
+ export PYTHON_VERSION=3.7.3
+ PYTHON_VERSION=3.7.3
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.233.120.95 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py
21/08/09 07:33:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Adding process 'e' without implementation
Adding process 'pi' without implementation
starting spark context
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/08/09 07:34:01 INFO SparkContext: Running Spark version 2.4.5
21/08/09 07:34:01 INFO SparkContext: Submitted application: myspark
21/08/09 07:34:01 INFO SecurityManager: Changing view acls to: root
21/08/09 07:34:01 INFO SecurityManager: Changing modify acls to: root
21/08/09 07:34:01 INFO SecurityManager: Changing view acls groups to: 
21/08/09 07:34:01 INFO SecurityManager: Changing modify acls groups to: 
21/08/09 07:34:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/08/09 07:34:01 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
21/08/09 07:34:01 INFO SparkEnv: Registering MapOutputTracker
21/08/09 07:34:01 INFO SparkEnv: Registering BlockManagerMaster
21/08/09 07:34:01 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/09 07:34:01 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/09 07:34:01 INFO DiskBlockManager: Created local directory at /var/data/spark-63a2105b-0616-4052-ad19-a2087d88988c/blockmgr-e72e24dc-b1a2-49f2-8523-555d7c40f06c
21/08/09 07:34:01 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
21/08/09 07:34:01 INFO SparkEnv: Registering OutputCommitCoordinator
21/08/09 07:34:01 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/08/09 07:34:02 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://myspark-6893917b29d50186-driver-svc.spark-jobs.svc:4040
21/08/09 07:34:02 INFO SparkContext: Added JAR local:///opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar at file:/opt/geotrellis-extensions-2.2.0-SNAPSHOT.jar with timestamp 1628494442037
21/08/09 07:34:02 INFO SparkContext: Added JAR local:///opt/geotrellis-backend-assembly-0.4.6-openeo.jar at file:/opt/geotrellis-backend-assembly-0.4.6-openeo.jar with timestamp 1628494442037
21/08/09 07:34:02 WARN SparkContext: File with 'local' scheme is not supported to add to file server, since it is already available on every node.
21/08/09 07:34:03 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
21/08/09 07:34:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
21/08/09 07:34:03 INFO NettyBlockTransferService: Server created on myspark-6893917b29d50186-driver-svc.spark-jobs.svc:7079
21/08/09 07:34:03 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/09 07:34:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, myspark-6893917b29d50186-driver-svc.spark-jobs.svc, 7079, None)
21/08/09 07:34:03 INFO BlockManagerMasterEndpoint: Registering block manager myspark-6893917b29d50186-driver-svc.spark-jobs.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, myspark-6893917b29d50186-driver-svc.spark-jobs.svc, 7079, None)
21/08/09 07:34:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, myspark-6893917b29d50186-driver-svc.spark-jobs.svc, 7079, None)
21/08/09 07:34:03 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, myspark-6893917b29d50186-driver-svc.spark-jobs.svc, 7079, None)
21/08/09 07:34:03 INFO CancelRunawayJobListener: initialized with timeout PT15M
21/08/09 07:34:03 INFO SparkContext: Registered listener org.openeo.sparklisteners.CancelRunawayJobListener
21/08/09 07:34:09 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.233.120.96:56670) with ID 1
21/08/09 07:34:09 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[2021-08-09 07:34:09,363] INFO in openeogeotrellis.service_registry: Creating new InMemoryServiceRegistry: <openeogeotrellis.service_registry.InMemoryServiceRegistry object at 0x7f5e034f7e10>
[2021-08-09 07:34:09,363] INFO in openeogeotrellis.layercatalog: Reading layer catalog metadata from /opt/layercatalog.json
[2021-08-09 07:34:09,364] INFO in openeogeotrellis.layercatalog: Updating SENTINEL2_L1C metadata from https://finder.creodias.eu:Sentinel2
[2021-08-09 07:34:09,364] INFO in openeogeotrellis.opensearch: Getting collection metadata from https://finder.creodias.eu/resto/collections.json
21/08/09 07:34:09 INFO BlockManagerMasterEndpoint: Registering block manager 10.233.120.96:45609 with 2.1 GB RAM, BlockManagerId(1, 10.233.120.96, 45609, None)
[2021-08-09 07:34:09,783] INFO in openeogeotrellis.layercatalog: Updating SENTINEL2_L2A metadata from https://finder.creodias.eu:Sentinel2
[2021-08-09 07:34:09,783] INFO in openeogeotrellis.opensearch: Getting collection metadata from https://finder.creodias.eu/resto/collections.json
/usr/local/lib/python3.7/dist-packages/openeo_driver/views.py:208: UserWarning: The name 'openeo' is already registered for this blueprint. Use 'name=' to provide a unique name. This will become an error in Flask 2.1.
  app.register_blueprint(bp, url_prefix='/openeo/<version>')
[2021-08-09 07:34:10,375] INFO in openeo_driver.views: App info logging enabled!
[2021-08-09 07:34:10,375] DEBUG in openeo_driver.views: App debug logging enabled!
[2021-08-09 07:34:10,375] INFO in openeo_driver.server: StandaloneApplication options: {'bind': '10.233.120.95:50001', 'workers': 1, 'threads': 10, 'worker_class': 'gthread', 'timeout': 1000, 'loglevel': 'DEBUG', 'accesslog': '-', 'errorlog': '-'}
[2021-08-09 07:34:10,376] INFO in openeo_driver.server: Creating StandaloneApplication
[2021-08-09 07:34:10,378] INFO in openeo_driver.server: Running StandaloneApplication
[2021-08-09 07:34:10 +0000] [47] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['10.233.120.95:50001']
  backlog: 2048
  workers: 1
  worker_class: gthread
  threads: 10
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 1000
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /opt/spark/work-dir
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: -
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: DEBUG
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: gunicorn
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f5e3f921620>
  on_reload: <function OnReload.on_reload at 0x7f5e3f921730>
  when_ready: <function run_gunicorn.<locals>.when_ready at 0x7f5e03534e18>
  pre_fork: <function Prefork.pre_fork at 0x7f5e3f921950>
  post_fork: <function Postfork.post_fork at 0x7f5e3f921a60>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f5e3f921b70>
  worker_int: <function WorkerInt.worker_int at 0x7f5e3f921c80>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f5e3f921d90>
  pre_exec: <function PreExec.pre_exec at 0x7f5e3f921ea0>
  pre_request: <function PreRequest.pre_request at 0x7f5e3f8b9048>
  post_request: <function PostRequest.post_request at 0x7f5e3f8b90d0>
  child_exit: <function ChildExit.child_exit at 0x7f5e3f8b91e0>
  worker_exit: <function WorkerExit.worker_exit at 0x7f5e3f8b92f0>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f5e3f8b9400>
  on_exit: <function OnExit.on_exit at 0x7f5e3f8b9510>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2021-08-09 07:34:10 +0000] [47] [INFO] Starting gunicorn 20.1.0
[2021-08-09 07:34:10 +0000] [47] [DEBUG] Arbiter booted
[2021-08-09 07:34:10 +0000] [47] [INFO] Listening at: http://10.233.120.95:50001 (47)
[2021-08-09 07:34:10 +0000] [47] [INFO] Using worker: gthread
[2021-08-09 07:34:10,388] INFO in openeo_driver.server: when_ready: <gunicorn.arbiter.Arbiter object at 0x7f5e02c200b8>
[2021-08-09 07:34:10 +0000] [47] [INFO] Gunicorn info logging enabled!
[2021-08-09 07:34:10,388] INFO in flask: Flask info logging enabled!
[2021-08-09 07:34:10,388] INFO in openeogeotrellis.deploy: Trying to load 'custom_processes' with PYTHONPATH ['/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy', '/var/data/spark-63a2105b-0616-4052-ad19-a2087d88988c/spark-55f80d6a-1a98-44c5-92dc-012de6f1c7ae/userFiles-d7c80393-a197-428b-9327-ba184d619ecf', '/opt/spark/python/lib/pyspark.zip', '/opt/spark/python/lib/py4j-0.10.7-src.zip', '/opt/spark/jars/spark-core_2.11-2.4.5.jar', '/opt/spark/python/lib/py4j-*.zip', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages']
[2021-08-09 07:34:10,389] INFO in openeogeotrellis.deploy: 'custom_processes' not loaded: ModuleNotFoundError("No module named 'custom_processes'").
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 89, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 84, in main
    on_started=on_started
  File "/usr/local/lib/python3.7/dist-packages/openeo_driver/server.py", line 119, in run_gunicorn
    application.run()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/arbiter.py", line 198, in run
    self.start()
  File "/usr/local/lib/python3.7/dist-packages/gunicorn/arbiter.py", line 167, in start
    self.cfg.when_ready(self)
  File "/usr/local/lib/python3.7/dist-packages/openeo_driver/server.py", line 113, in when_ready
    on_started()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 60, in on_started
    setup_batch_jobs()
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/deploy/kube.py", line 51, in setup_batch_jobs
    with JobRegistry() as job_registry:
  File "/usr/local/lib/python3.7/dist-packages/openeogeotrellis/job_registry.py", line 183, in __enter__
    self._zk.start()
  File "/usr/local/lib/python3.7/dist-packages/kazoo/client.py", line 635, in start
    raise self.handler.timeout_exception("Connection time-out")
kazoo.handlers.threading.KazooTimeoutError: Connection time-out
21/08/09 07:34:30 INFO SparkContext: Invoking stop() from shutdown hook
21/08/09 07:34:30 INFO SparkUI: Stopped Spark web UI at http://myspark-6893917b29d50186-driver-svc.spark-jobs.svc:4040
21/08/09 07:34:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
21/08/09 07:34:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
21/08/09 07:34:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
21/08/09 07:34:31 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/08/09 07:34:31 INFO MemoryStore: MemoryStore cleared
21/08/09 07:34:31 INFO BlockManager: BlockManager stopped
21/08/09 07:34:31 INFO BlockManagerMaster: BlockManagerMaster stopped
21/08/09 07:34:31 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/08/09 07:34:31 INFO SparkContext: Successfully stopped SparkContext
21/08/09 07:34:31 INFO ShutdownHookManager: Shutdown hook called
21/08/09 07:34:31 INFO ShutdownHookManager: Deleting directory /var/data/spark-63a2105b-0616-4052-ad19-a2087d88988c/spark-55f80d6a-1a98-44c5-92dc-012de6f1c7ae
21/08/09 07:34:31 INFO ShutdownHookManager: Deleting directory /tmp/spark-d164ad5f-f25e-4169-98c9-bc8228211600
21/08/09 07:34:31 INFO ShutdownHookManager: Deleting directory /var/data/spark-63a2105b-0616-4052-ad19-a2087d88988c/spark-55f80d6a-1a98-44c5-92dc-012de6f1c7ae/pyspark-419ee304-e4ed-4291-b365-6b15ca39aaa9
tcassaert commented 3 years ago

It's really weird that it's still trying to access Zookeeper. When I set TRAVIS: "1" in the driver envVars, it does skip the connection to Zookeeper.

We currently have a Zookeeper deployed with this chart: https://github.com/bitnami/charts/tree/master/bitnami/zookeeper

The values.yaml is as follows:

---
global:
  storageClass: "default"
replicaCount: 3

This is very minimal. You just need to make sure the storageClass is setup.

tiagofglip commented 3 years ago

Hello, We get something now finally. We had to downgrade kubernetes version from v1.21.3 to v1.20.6. Since we don't have experience with spark we dont't know if it works like it supposed to do, but the driver and executor in namespace spark-jobs stay running and it's possible to see the user interface from service on port 4040.

It was not necessary to use Zookeeper.

imagem

backeb commented 3 years ago

Excellent work @tiagofglip, @zbenta @tcassaert and @jdries ❗ Thank you very much for your efforts 🙏 If you could give @Jaapel and @avgils access to the VM, we could test it based on #3

zbenta commented 3 years ago

We just nee the ssh public keys for the users that need access. @Jaapel and @avgils, could you guys send them to us at zacarias@lip.pt or tiagofg@lip.pt?

backeb commented 3 years ago

Could we schedule a meeting next week, or the week after, to discuss next steps?

I know @jdries is on holiday until the end of the month. @tcassaert would you feel comfortable to field questions in his stead? It would be nice to be able to put something together in time for the User Forum Kick Off Meeting on 10 Sept.

@zbenta @tiagofglip please let me know if next week suits, then I will set up a doodle to find a good date and time.

cc @mariojmdavid @Jaapel @gena @avgils

P.S. I am on leave next week, but I am not critical to the discussions.

zbenta commented 3 years ago

I believe that for us it is fine, we usually have 2 meetings scheduled every week, one on monday and another on tuesday, both after lunch. For the time being, there are no other meetings scheduled for next week. @mariojmdavid, are you still on hollidays next week?

Could we schedule a meeting next week, or the week after, to discuss next steps?

I know @jdries is on holiday until the end of the month. @tcassaert would you feel comfortable to field questions in his stead? It would be nice to be able to put something together in time for the User Forum Kick Off Meeting on 10 Sept.

@zbenta @tiagofglip please let me know if next week suits, then I will set up a doodle to find a good date and time.

cc @mariojmdavid @Jaapel @gena @avgils

P.S. I am on leave next week, but I am not critical to the discussions.

backeb commented 3 years ago

Ok great! @Jaapel could you please schedule a meeting for next week and include: @zbenta @tiagofglip @avgils @tcassaert Could you all please share your email addresses so @Jaapel can schedule the meeting.

One thing I was wondering: ❓ is it worth for us to work with the Google Earth Engine backend to replicate the Aquamonitor functionality (#3) until the OpenEO backend on INCD has been fully implemented (i.e. including processes and data collections, etc) and then just switch to the INCD backend?

zbenta commented 3 years ago

We should also include @jopina, since booth me and @tiagofglip are new to the project, @mariojmdavid, will not be available next week.

Ok great! @Jaapel could you please schedule a meeting for next week and include: @zbenta @tiagofglip @avgils @tcassaert Could you all please share your email addresses so @Jaapel can schedule the meeting.

One thing I was wondering: question is it worth for us to work with the Google Earth Engine backend to replicate the Aquamonitor functionality (#3) until the OpenEO backend on INCD has been fully implemented (i.e. including processes and data collections, etc) and then just switch to the INCD backend?

backeb commented 3 years ago

Thanks @zbenta, could you please share yours and @tiagofglip's email addresses?