Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

no shub units for failed jobs #457

Closed jdries closed 1 year ago

jdries commented 1 year ago

This job is failed: j-0844c56ad3ec4d5f99e997350a1196c0, but it probably used sh credits. I believe these credits are not properly tracked for failed jobs. Suggestion would be to also take care of credit tracking during error handling?

bossie commented 1 year ago

What back-end was this job created on?

bossie commented 1 year ago

Terrascope prod, got it.

bossie commented 1 year ago

Process graph:

{
  "process_graph": {
    "BIOMASS1": {
      "process_id": "BIOMASS",
      "arguments": {
        "date": [
          "2020-01-01",
          "2020-12-31"
        ],
        "polygon": {
          "type": "GeometryCollection",
          "geometries": [
            {
              "type": "Polygon",
              "coordinates": [
                [
                  [
                    8.388971,
                    55.701101
                  ],
                  [
                    8.404837,
                    55.701586
                  ],
                  [
                    8.404922,
                    55.70069
                  ],
                  [
                    8.389057,
                    55.700205
                  ],
                  [
                    8.388971,
                    55.701101
                  ]
                ]
              ]
            }
          ]
        }
      },
      "namespace": "vito"
    },
    "saveresult1": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "BIOMASS1"
        },
        "format": "JSON",
        "options": {}
      },
      "result": true
    }
  },
  "job_options": {
    "driver-memory": "1gb",
    "driver-memoryOverhead": "500mb"
  }
}
bossie commented 1 year ago

job j-0844c56ad3ec4d5f99e997350a1196c0: status change from running to error

logging resource usage {'jobId': 'j-0844c56ad3ec4d5f99e997350a1196c0', 'jobName': 'BIOMASS_DEN_10ha_JSON', 'executionId': 'application_1682817307182_87601', 'userId': 'marketplace_benchmarks', 'sourceId': 'TerraScope/MEP', 'orchestrator': 'openeo', 'jobStart': 1684634968794.0, 'jobFinish': 1684635130521.0, 'state': 'FAILED', 'status': 'UNDEFINED', 'metrics': {'cpu': {'value': 6510, 'unit': 'cpu-seconds'}, 'memory': {'value': 14891649, 'unit': 'mb-seconds'}, 'time': {'value': 161727.0, 'unit': 'milliseconds'}, 'processing': {'value': 0.0, 'unit': 'shpu'}}}

The root cause was this:

YARN application status reports error diagnostics: Application application_1682817307182_87601 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1682817307182_87601_000001 exited with  exitCode: -104
Failing this attempt.Diagnostics: [2023-05-21 04:12:09.753]Container [pid=31573,containerID=container_e5077_1682817307182_87601_01_000001] is running 30707712B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 10.0 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e5077_1682817307182_87601_01_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 31588 31573 31588 31573 (bash) 0 0 15302656 431 /bin/bash -c /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx1024m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/tmp '-Dscala.concurrent.context.maxThreads=2' '-Dpixels.treshold=100000000' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=/data/projects/OpenEO/1684634956.hprof' '-Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml' '-Dhdp.version=3.1.4.0-315' '-Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService' '-Dopeneo.logging.threshold=INFO' -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file batch_job.py --arg 'j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in' --arg '/data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0' --arg 'out' --arg 'log' --arg 'job_metadata.json' --arg '1.1.0' --arg '[]' --arg 'marketplace_benchmarks' --arg '0.0' --arg 'default' --properties-file /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_dist_cache__.properties 1> /data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/stdout 2> /data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/stderr 
    |- 31774 31628 31588 31573 (python) 678 144 2471317504 61988 /opt/venv/bin/python batch_job.py j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in /data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0 out log job_metadata.json 1.1.0 [] marketplace_benchmarks 0.0 default 
    |- 31573 31555 31573 31573 (tini) 3 3 4464640 103 /usr/bin/tini -s -- bash /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/launch_container.sh 
    |- 31628 31588 31588 31573 (java) 6836 1079 8268267520 338191 /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx1024m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/tmp -Dscala.concurrent.context.maxThreads=2 -Dpixels.treshold=100000000 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/projects/OpenEO/1684634956.hprof -Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml -Dhdp.version=3.1.4.0-315 -Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService -Dopeneo.logging.threshold=INFO -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file batch_job.py --arg j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in --arg /data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0 --arg out --arg log --arg job_metadata.json --arg 1.1.0 --arg [] --arg marketplace_benchmarks --arg 0.0 --arg default --properties-file /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_dist_cache__.properties 

[2023-05-21 04:12:10.332]Container killed on request. Exit code is 143
[2023-05-21 04:12:10.488]Container exited with a non-zero exit code 143. YARN application status reports error diagnostics: Application application_1682817307182_87601 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1682817307182_87601_000001 exited with  exitCode: -104
Failing this attempt.Diagnostics: [2023-05-21 04:12:09.753]Container [pid=31573,containerID=container_e5077_1682817307182_87601_01_000001] is running 30707712B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 10.0 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e5077_1682817307182_87601_01_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 31588 31573 31588 31573 (bash) 0 0 15302656 431 /bin/bash -c /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx1024m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/tmp '-Dscala.concurrent.context.maxThreads=2' '-Dpixels.treshold=100000000' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=/data/projects/OpenEO/1684634956.hprof' '-Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml' '-Dhdp.version=3.1.4.0-315' '-Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService' '-Dopeneo.logging.threshold=INFO' -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file batch_job.py --arg 'j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in' --arg '/data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0' --arg 'out' --arg 'log' --arg 'job_metadata.json' --arg '1.1.0' --arg '[]' --arg 'marketplace_benchmarks' --arg '0.0' --arg 'default' --properties-file /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_dist_cache__.properties 1> /data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/stdout 2> /data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/stderr 
    |- 31774 31628 31588 31573 (python) 678 144 2471317504 61988 /opt/venv/bin/python batch_job.py j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in /data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0 out log job_metadata.json 1.1.0 [] marketplace_benchmarks 0.0 default 
    |- 31573 31555 31573 31573 (tini) 3 3 4464640 103 /usr/bin/tini -s -- bash /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/launch_container.sh 
    |- 31628 31588 31588 31573 (java) 6836 1079 8268267520 338191 /usr/lib/jvm/jre/bin/java -server -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Xmx1024m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/tmp -Dscala.concurrent.context.maxThreads=2 -Dpixels.treshold=100000000 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/projects/OpenEO/1684634956.hprof -Dlog4j2.configurationFile=file:/opt/venv/openeo-geopyspark-driver/batch_job_log4j2.xml -Dhdp.version=3.1.4.0-315 -Dsoftware.amazon.awssdk.http.service.impl=software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService -Dopeneo.logging.threshold=INFO -Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file batch_job.py --arg j-0844c56ad3ec4d5f99e997350a1196c0_ds0dg0nb.in --arg /data/projects/OpenEO/j-0844c56ad3ec4d5f99e997350a1196c0 --arg out --arg log --arg job_metadata.json --arg 1.1.0 --arg [] --arg marketplace_benchmarks --arg 0.0 --arg default --properties-file /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data4/hadoop/yarn/local/usercache/marketplace_benchmarks/appcache/application_1682817307182_87601/container_e5077_1682817307182_87601_01_000001/__spark_conf__/__spark_dist_cache__.properties 

[2023-05-21 04:12:10.332]Container killed on request. Exit code is 143
[2023-05-21 04:12:10.488]Container exited with a non-zero exit code 143. 
For more detailed output, check the application tracking page: https://epod-master2.vgt.vito.be:8090/cluster/app/application_1682817307182_87601 Then click on links to logs of each attempt.
. Failing the application.
For more detailed output, check the application tracking page: https://epod-master2.vgt.vito.be:8090/cluster/app/application_1682817307182_87601 Then click on links to logs of each attempt.
. Failing the application.
bossie commented 1 year ago

As discussed: not a good example because this job was killed by YARN and there's no straightforward way to recover SH PUs in this case. That would require some external counter for example.

The focus of this ticket is to still report accurate SH PUs for errors in the batch job itself (think: the catch block).

bossie commented 1 year ago

The problem is not limited to "sentinelhub" usage: nojob_metadata.json is created for a failed job, at all.

As a test case, this process graph will consume SH PUs but ultimately fail because of an error within the batch job:

connection = openeo.connect("openeo-dev.vito.be").authenticate_oidc()

data_cube = (connection
             .load_collection("SENTINEL1_GRD",
                              spatial_extent={"west": 16.06, "south": 48.06, "east": 16.07, "north": 48.07},
                              temporal_extent=["2017-03-01", "2017-03-10"],
                              bands=["VV", "VH"])
             .reduce_dimension("t", reducer="mean"))

udf = """
from openeo.udf import XarrayDataCube

def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
    raise Exception("intentionally failing a SHub batch job")
"""

udf = textwrap.dedent(udf)
udf = openeo.UDF(udf, runtime="Python", data={"from_parameter": "x"})

data_cube = data_cube.apply(process=udf)

data_cube.execute_batch("/tmp/test_fail_but_report_shub_pus_batch.tif",
                        title="test_fail_but_report_shub_pus failing UDF",
                        job_options={"logging-threshold": "debug"})
bossie commented 1 year ago

Terrascope build is having a hard time so tested the above on CDSE dev with a different SHub collection instead:

connection = openeo.connect("openeo.dev.warsaw.openeo.dataspace.copernicus.eu").authenticate_oidc("CDSE")

data_cube = (connection.load_collection("SENTINEL3_OLCI_L1B")
             .filter_bands(["B02", "B17", "B19"])
             .filter_bbox([2.59003, 51.069, 2.8949, 51.2206])
             .filter_temporal(["2018-08-06T00:00:00Z", "2018-08-06T00:00:00Z"])
             .reduce_dimension("t", reducer="mean"))

...

Job metadata for this failed job including "sentinelhub" usage (the PUs match those reported by ProcessApi.scala in the logs):

{
    "costs": 1.0,
    "created": "2023-06-28T13:36:29Z",
    "id": "j-b8c0e472ca2e4718a2af63735996922a",
    "process": {
        "process_graph": {
            "apply1": {
                "arguments": {
                    "data": {
                        "from_node": "reducedimension1"
                    },
                    "process": {
                        "process_graph": {
                            "runudf1": {
                                "arguments": {
                                    "data": {
                                        "from_parameter": "x"
                                    },
                                    "runtime": "Python",
                                    "udf": "\nfrom openeo.udf import XarrayDataCube\n\ndef apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:\n    raise Exception(\"intentionally failing a SHub batch job\")\n"
                                },
                                "process_id": "run_udf",
                                "result": true
                            }
                        }
                    }
                },
                "process_id": "apply"
            },
            "filterbands1": {
                "arguments": {
                    "bands": [
                        "B02",
                        "B17",
                        "B19"
                    ],
                    "data": {
                        "from_node": "loadcollection1"
                    }
                },
                "process_id": "filter_bands"
            },
            "filterbbox1": {
                "arguments": {
                    "data": {
                        "from_node": "filterbands1"
                    },
                    "extent": {
                        "east": 2.8949,
                        "north": 51.2206,
                        "south": 51.069,
                        "west": 2.59003
                    }
                },
                "process_id": "filter_bbox"
            },
            "filtertemporal1": {
                "arguments": {
                    "data": {
                        "from_node": "filterbbox1"
                    },
                    "extent": [
                        "2018-08-06T00:00:00Z",
                        "2018-08-06T00:00:00Z"
                    ]
                },
                "process_id": "filter_temporal"
            },
            "loadcollection1": {
                "arguments": {
                    "id": "SENTINEL3_OLCI_L1B",
                    "spatial_extent": null,
                    "temporal_extent": null
                },
                "process_id": "load_collection"
            },
            "reducedimension1": {
                "arguments": {
                    "data": {
                        "from_node": "filtertemporal1"
                    },
                    "dimension": "t",
                    "reducer": {
                        "process_graph": {
                            "mean1": {
                                "arguments": {
                                    "data": {
                                        "from_parameter": "data"
                                    }
                                },
                                "process_id": "mean",
                                "result": true
                            }
                        }
                    }
                },
                "process_id": "reduce_dimension"
            },
            "saveresult1": {
                "arguments": {
                    "data": {
                        "from_node": "apply1"
                    },
                    "format": "GTiff",
                    "options": {}
                },
                "process_id": "save_result",
                "result": true
            }
        }
    },
    "status": "error",
    "title": "test_fail_but_report_shub_pus failing UDF",
    "updated": "2023-06-28T13:36:29Z",
    "usage": {
        "duration": {
            "unit": "seconds",
            "value": 54
        },
        "sentinelhub": {
            "unit": "sentinelhub_processing_unit",
            "value": 2.0
        }
    }
}
bossie commented 1 year ago

These PUs are still correctly reported to the ETL API:

failed_shpu_etl_cdse_dev

bossie commented 1 year ago

Works on Terrascope as well incl. ETL API:

failed_shpu_editor_terrascope_dev