Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

improve JobTracker error mails from Nifi #707

Closed bossie closed 3 months ago

bossie commented 7 months ago

Nifi sends an e-mail if the JobTracker process exits with a non-zero exit status. The root cause is not always clear.

bossie commented 7 months ago

Problem 1: the stderr output of docker rmi is included in the execution.error attribute:

execution.error = '"docker rmi" requires at least 1 argument. See 'docker rmi --help'.

This is harmless and does not cause the run to fail but it is confusing. I've adapted the configuration of the ExecuteStreamCommand processor to not include it in execution.error.

bossie commented 7 months ago

Problem 2: the execution.errorattribute is cut off before the root cause. In particular it is cut off to a fixed 4000 characters.

JeroenVerstraelen commented 7 months ago

Perhaps we should also add a kibana link to the nifi mails.

bossie commented 7 months ago

Good suggestion by @soxofaan as well:

what would also help is to inject the nifi flow id as correlation id in the logs, so that you can zoom in on the logs of a particular run

soxofaan commented 7 months ago

automatic injecting such a correlation id would be another use case for this PoC PR https://github.com/Open-EO/openeo-python-driver/pull/233

bossie commented 6 months ago

@soxofaan I've been playing with this ExtraLoggingFilter and an issue that I'm encountering is that log entries for unhandled exceptions do not get the intended extras: https://jenkins.vgt.vito.be/job/openEO/job/openeo-geopyspark-driver/job/707_run_id_ExtraLoggingFilter/1/testReport/junit/tests.test_job_tracker_v2/TestCliApp/test_run_rotating_log/

I suppose it's to be expected because the ExtraLoggingFilter will restore the extras upon leaving the with block i.e. before the sys.excepthook has a chance to output the error.

I'm guessing you have the same issue when you use ExtraLoggingFilter in the queue_job implementation and an exception bubbles up.

Maybe BatchJobLoggingFilter is a better fit in my case? I'm thinking BatchJobLoggingFilter could also be renamed to something more generic like GlobalLoggingFilter (and ExtraLoggingFilter to ThreadLocalLoggingFilter).

Thoughts?

soxofaan commented 6 months ago

Indeed, with ExtraLoggingFilter.with_extra_logging, by design, only injects extras in logging calls that happen during the body of the with construct. Exceptions triggered from withing the body, but handled outside of it will not receive injections.

If I understand that feature branch correctly, you're looking here into setting a global "run_id". In that case, what is currently called BatchJobLoggingFilter indeed looks like a better match to do that.

BatchJobLoggingFilter contains nothing inherently related to batch jobs (except for its documentation) so it makes sense to rename it to GlobalLoggingFilter or GlobalExtraLoggingFilter.

Regarding ExtraLoggingFilter I would rename it to ContextExtraLoggingFilter, because the defining feature there is to only add extras inside the with context. The thread local thing is not that vital.

soxofaan commented 6 months ago

I'm guessing you have the same issue when you use ExtraLoggingFilter in the queue_job implementation and an exception bubbles up.

Good point. I guess the current assumption is that for exceptions that bubble through there, the extra logging fields are not essential (if it's essential, the exception should probably be handled inside the context).

bossie commented 5 months ago

Problem 3: because it's the Python process that writes JSON logs to a rotating log file to be picked up by filebeat (/var/log/openeo/job_tracker_python*.log*), errors that occur outside of Python do not end up in Kibana (they are only written to stderr), e.g.:

execution.error = 'kinit: Cannot contact any KDC for realm 'VGT.VITO.BE' while getting initial credentials

and

execution.error = 'Error response from daemon: Get https://vito-docker-private.artifactory.vgt.vito.be/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

bossie commented 4 months ago

A run_id is added to the log entries if you pass it to the job_tracker-entrypoint.sh script (optional).

TODO: add a Kibana link to the e-mails that looks like this:

https://kibana-infra.vgt.vito.be/app/kibana#/discover?_g=(time:(from:now-1w,to:now))&_a=(columns:!(levelname,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7fd36110-df93-11ee-8cc4-3747d5233c59',key:run_id,negate:!f,params:(query:d9396056-7553-47e7-9f18-f8d721b15d12),type:phrase),query:(match:(run_id:(query:d9396056-7553-47e7-9f18-f8d721b15d12,type:phrase))))),index:'7fd36110-df93-11ee-8cc4-3747d5233c59',interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc)))

bossie commented 4 months ago

Improved the subject line to include the environment: prod, dev or integrationtests.

The body of the e-mail now has a link to this run's logs in Kibana; it also contains all Flowfile attributes including the run_id, execution.error (stderr, important for errors that do not originate in Python like kinit) and, again, deploy_env.

Stdout of the ExecuteStreamCommand is attached to the mail for context and debugging purposes.

bossie commented 4 months ago

Aligned prod job tracker version in Nifi to that of the web app. Its logs are being populated with run IDs so now we wait for the errors to come in.

bossie commented 4 months ago

Clicking the link in the first error mail opens Kibana and

Unable to parse URL

Link: https://kibana-infra.vgt.vito.be/app/kibana#/discover?_g=(time:(from:now-1w,to:now))&_a=(columns:!(levelname,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7fd36110-df93-11ee-8cc4-3747d5233c59',key:run_id,negate:!f,params:(query:23881a3b-b007-4db2-8367-b9dc218593a1),type:phrase),query:(match:(run_id:(query:23881a3b-b007-4db2-8367-b9dc218593a1,type:phrase))))),index:'7fd36110-df93-11ee-8cc4-3747d5233c59',interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc)))

bossie commented 4 months ago

A run_id of (see above) d9396056-7553-47e7-9f18-f8d721b15d12 in the URL works, whereas 23881a3b-b007-4db2-8367-b9dc218593a1 does not. My guess is that Kibana (rightfully) interprets the former as a string because it starts with a letter, whereas it considers the latter a number.

Both seem to work by enclosing the run_id in single quotes:

https://kibana-infra.vgt.vito.be/app/kibana#/discover?_g=(time:(from:now-1w,to:now))&_a=(columns:!(levelname,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7fd36110-df93-11ee-8cc4-3747d5233c59',key:run_id,negate:!f,params:(query:'23881a3b-b007-4db2-8367-b9dc218593a1'),type:phrase),query:(match:(run_id:(query:'23881a3b-b007-4db2-8367-b9dc218593a1',type:phrase))))),index:'7fd36110-df93-11ee-8cc4-3747d5233c59',interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc)))

bossie commented 4 months ago

Adapted the URL pattern in the e-mail processor to include single quotes around the run_id.

bossie commented 4 months ago

Originally had flow file ID (${uuid}) for a run_id to be able to look it up in Data Provenance but this was apparently still duplicated across prod/dev/integrationtests i.e. not unique for a particular run; these flow file IDs also seem to disappear from Data Provenance rather quickly so ${UUID()} became the run_id.