Java logs in batch jobs on k8s lack user/job ID

Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)

Apache License 2.0

26 stars 4 forks source link

Java logs in batch jobs on k8s lack user/job ID #419

Closed bossie closed 1 year ago

bossie commented 1 year ago

Using MDC to attach user ID, job ID etc (https://github.com/Open-EO/openeo-geotrellis-extensions/issues/64) did not seem to cover all log enrtries, in particular Spark TaskSetManager logs. Unlike the OpenEO web app, a batch job can grab these values from environment variables so in this case we decided to go back to a more reliable envar-based approach.

This meant different Log4j 2 configuration files for web app and batch jobs on Terrascope: log4j2.xml that uses OpenEOJsonLogLayout.json and batch_job_log4j2.xml that uses classpath:OpenEOBatchJobJsonLogLayout.json respectively.

Code that sets up MDC has since been removed from batch jobs but k8s hasn't been adapted yet to use the envar-based approach, effectively removing user ID and job ID from log entries in batch jobs.

bossie commented 1 year ago

Ported the envar-based approach to k8s.

EmileSonneveld commented 1 year ago

Would these files need to get synced up? https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/docker/batch_job_log4j2.xml https://github.com/Open-EO/openeo-geopyspark-driver/blob/master/scripts/batch_job_log4j2.xml

For example copying over <Logger name="org.apache.spark.scheduler.TaskSetManager" level="warn"/>

bossie commented 1 year ago

Sure, it makes sense to apply your logging enhancements to CDSE too.

EmileSonneveld commented 1 year ago

Settings applied: https://github.com/Open-EO/openeo-geotrellis-kubernetes/commit/7f13e8de2c8e1f420d939f8b84932e9d664dfd18

bossie commented 1 year ago

Fixed:

creo_job_java_logs

Tested with:

connection = openeo.connect("https://openeo-staging.creo.vito.be").authenticate_oidc("CDSE")

data_cube = (connection.load_collection("SENTINEL3_OLCI_L1B")
             .filter_bands(["B02", "B17", "B19"])
             .filter_bbox([2.59003, 51.069, 2.8949, 51.2206])
             .filter_temporal(["2018-08-06T00:00:00Z", "2018-08-06T00:00:00Z"])
             .reduce_dimension("t", reducer="mean"))

data_cube.execute_batch("/tmp/test_cdse_sentinel3_olci_staging_batch.tif",
                        job_options={"logging-threshold": "debug"})