Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

terrascope prod job driver OOM during gdalinfo #809

Open jdries opened 1 week ago

jdries commented 1 week ago

I had this job being killed at the end: j-240621ca3d724f398310a7435e979db3

The process dump contains a bunch of python processes, see below. Not sure if we simply need an increase in default memory overhead here.

Application application_1718705245374_7956 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1718705245374_7956_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2024-06-21 10:03:43.835]Container [pid=337,containerID=container_e5130_1718705245374_7956_01_000001] is running 1982193664B beyond the 'PHYSICAL' memory limit. Current usage: 11.8 GB of 10 GB physical memory used; 47.4 GB of 21 GB virtual memory used. Killing container.
Dump of the process-tree for container_e5130_1718705245374_7956_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 6980 717 365 337 (python) 0 0 2337513472 51646 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6982 717 365 337 (python) 0 0 2337513472 51646 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6981 717 365 337 (python) 0 0 2337513472 51647 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6984 717 365 337 (python) 0 0 2337513472 51622 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6983 717 365 337 (python) 0 0 2337513472 51622 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 365 337 365 337 (bash) 0 0 15302656 433 /bin/bash -c /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx8192m -Djava.io.tmpdir=/data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/tmp '-Dscala.concurrent.context.maxThreads=2' '-Dpixels.treshold=100000000' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=/data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3' '-XX:ErrorFile=/data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3/hs_err_pid%p.log' '-Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml' '-Dhdp.version=3.1.4.0-315' '-Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService' '-Dopeneo.logging.threshold=INFO' -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file batch_job.py --arg 'j-240621ca3d724f398310a7435e979db3_fsjop7gp.in' --arg '/data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3' --arg 'out' --arg 'job_metadata.json' --arg '1.1.0' --arg '[]' --arg '32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb' --arg '0.1' --arg 'default' --properties-file /data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/__spark_conf__/__spark_dist_cache__.properties 1> /data1/hadoop/yarn/log/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/stdout 2> /data1/hadoop/yarn/log/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/stderr
|- 337 317 337 337 (tini) 3 4 4464640 103 /usr/bin/tini -s -- bash /data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/launch_container.sh
|- 6975 717 365 337 (python) 420 88 3726344192 391799 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6977 717 365 337 (python) 384 120 4153794560 496122 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 407 365 365 337 (java) 16030 2197 17170636800 559336 /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx8192m -Djava.io.tmpdir=/data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/tmp -Dscala.concurrent.context.maxThreads=2 -Dpixels.treshold=100000000 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 -XX:ErrorFile=/data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3/hs_err_pid%p.log -Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml -Dhdp.version=3.1.4.0-315 -Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService -Dopeneo.logging.threshold=INFO -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file batch_job.py --arg j-240621ca3d724f398310a7435e979db3_fsjop7gp.in --arg /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 --arg out --arg job_metadata.json --arg 1.1.0 --arg [] --arg 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb --arg 0.1 --arg default --properties-file /data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data3/hadoop/yarn/local/usercache/openeo/appcache/application_1718705245374_7956/container_e5130_1718705245374_7956_01_000001/__spark_conf__/__spark_dist_cache__.properties
|- 6976 717 365 337 (python) 388 104 3916599296 438277 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6979 717 365 337 (python) 383 119 4090806272 480717 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 717 407 365 337 (python) 885 125 2337513472 69968 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
|- 6978 717 365 337 (python) 434 94 3802697728 410436 /opt/venv/bin/python batch_job.py j-240621ca3d724f398310a7435e979db3_fsjop7gp.in /data/projects/OpenEO/j-240621ca3d724f398310a7435e979db3 out job_metadata.json 1.1.0 [] 32cc7ddb-f7e1-4b1c-9796-f9fe39cb8feb 0.1 default
[2024-06-21 10:03:45.996]Container killed on request. Exit code is 143
[2024-06-21 10:03:46.183]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: https://epod-master2.vgt.vito.be:8090/cluster/app/application_1718705245374_7956 Then click on links to logs of each attempt.
. Failing the application.

The results generated by the job are really quite reasonable:


-rw-rw-r--.     1 openeo eodata 6.1M Jun 21 10:05 openEO_2021-11-02Z.tif
-rw-rw-r--.     1 openeo eodata  16M Jun 21 10:05 openEO_2021-07-25Z.tif
-rw-rw-r--.     1 openeo eodata  15M Jun 21 10:05 openEO_2021-11-05Z.tif
-rw-rw-r--.     1 openeo eodata  38M Jun 21 10:05 openEO_2021-11-14Z.tif
-rw-rw-r--.     1 openeo eodata  44M Jun 21 10:05 openEO_2021-08-06Z.tif