Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

failed job resources logged as successful on YARN #565

Open bossie opened 8 months ago

bossie commented 8 months ago

The /resources endpoint of the ETL API accepts a state property, which in the case of YARN corresponds to the job's "state" and a status property, which corresponds to the job's "final state/status".

As far as the ETL API is concerned, state is actually used whereas status is just for reporting.

Jobs like these will report a state of FINISHED, even though they failed, as indicated by their final state:

Application Report : 
    Application-Id : application_1696843816575_110814
    Application-Name : openEO batch_test_fail_but_report_shub_pus failing UDF_j-2311033a2d7f4883be0d2d477966a8d9_user vdboschj
    Application-Type : SPARK
    User : vdboschj
    Queue : default
    Application Priority : 0
    Start-Time : 1699002526901
    Finish-Time : 1699002734905
    Progress : 100%
    State : FINISHED
    Final-State : FAILED
    Tracking-URL : epod-ha.vgt.vito.be:18481/history/application_1696843816575_110814/1
    RPC Port : 34771
    AM Host : epod134.vgt.vito.be
    Aggregate Resource Allocation : 6477431 MB-seconds, 2736 vcore-seconds
    Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
    Log Aggregation Status : SUCCEEDED
    Diagnostics : User application exited with status 1
    Unmanaged Application : false
    Application Node Label Expression : <Not set>
    AM container Node Label Expression : <DEFAULT_PARTITION>
    TimeoutType : LIFETIME  ExpiryTime : UNLIMITED  RemainingTime : -1seconds

The result seems to be that the job's resources are logged as successful instead of failed, but I'm not sure what the implications are.

Job tracker logs:

job_tracker

Corresponding ETL API records:

etl_api

bossie commented 8 months ago

To reproduce:

connection = openeo.connect("openeo.vito.be").authenticate_basic("???", "!!!")

data_cube = (connection.load_collection("SENTINEL3_OLCI_L1B")
             .filter_bands(["B02", "B17", "B19"])
             .filter_bbox([2.59003, 51.069, 2.8949, 51.2206])
             .filter_temporal(["2018-08-06T00:00:00Z", "2018-08-06T00:00:00Z"])
             .reduce_dimension("t", reducer="mean"))

udf = """
from openeo.udf import XarrayDataCube

def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
    raise Exception("intentionally failing a SHub batch job")
"""

udf = textwrap.dedent(udf)
udf = openeo.UDF(udf, runtime="Python", data={"from_parameter": "x"})

data_cube = data_cube.apply(process=udf)

data_cube.execute_batch("/tmp/test_fail_but_report_shub_pus_batch.tif",
                        title="test_fail_but_report_shub_pus failing UDF",
                        job_options={"logging-threshold": "debug"})
bossie commented 7 months ago

Similar: https://github.com/Open-EO/FuseTS/issues/118

soxofaan commented 6 months ago

related to internal ticket MKTP-286