improve JobTracker error mails from Nifi

bossie commented 7 months ago

Nifi sends an e-mail if the JobTracker process exits with a non-zero exit status. The root cause is not always clear.

bossie commented 7 months ago

Problem 1: the stderr output of docker rmi is included in the execution.error attribute:

execution.error = '"docker rmi" requires at least 1 argument. See 'docker rmi --help'.

This is harmless and does not cause the run to fail but it is confusing. I've adapted the configuration of the ExecuteStreamCommand processor to not include it in execution.error.

bossie commented 7 months ago

Problem 2: the execution.errorattribute is cut off before the root cause. In particular it is cut off to a fixed 4000 characters.

JeroenVerstraelen commented 7 months ago

Perhaps we should also add a kibana link to the nifi mails.

bossie commented 7 months ago

Good suggestion by @soxofaan as well:

what would also help is to inject the nifi flow id as correlation id in the logs, so that you can zoom in on the logs of a particular run

soxofaan commented 7 months ago

automatic injecting such a correlation id would be another use case for this PoC PR https://github.com/Open-EO/openeo-python-driver/pull/233

bossie commented 6 months ago

@soxofaan I've been playing with this ExtraLoggingFilter and an issue that I'm encountering is that log entries for unhandled exceptions do not get the intended extras: https://jenkins.vgt.vito.be/job/openEO/job/openeo-geopyspark-driver/job/707_run_id_ExtraLoggingFilter/1/testReport/junit/tests.test_job_tracker_v2/TestCliApp/test_run_rotating_log/

I suppose it's to be expected because the ExtraLoggingFilter will restore the extras upon leaving the with block i.e. before the sys.excepthook has a chance to output the error.

I'm guessing you have the same issue when you use ExtraLoggingFilter in the queue_job implementation and an exception bubbles up.

Maybe BatchJobLoggingFilter is a better fit in my case? I'm thinking BatchJobLoggingFilter could also be renamed to something more generic like GlobalLoggingFilter (and ExtraLoggingFilter to ThreadLocalLoggingFilter).

Thoughts?

soxofaan commented 6 months ago

Indeed, with ExtraLoggingFilter.with_extra_logging, by design, only injects extras in logging calls that happen during the body of the with construct. Exceptions triggered from withing the body, but handled outside of it will not receive injections.

If I understand that feature branch correctly, you're looking here into setting a global "run_id". In that case, what is currently called BatchJobLoggingFilter indeed looks like a better match to do that.

BatchJobLoggingFilter contains nothing inherently related to batch jobs (except for its documentation) so it makes sense to rename it to GlobalLoggingFilter or GlobalExtraLoggingFilter.

Regarding ExtraLoggingFilter I would rename it to ContextExtraLoggingFilter, because the defining feature there is to only add extras inside the with context. The thread local thing is not that vital.

soxofaan commented 6 months ago

I'm guessing you have the same issue when you use ExtraLoggingFilter in the queue_job implementation and an exception bubbles up.

Good point. I guess the current assumption is that for exceptions that bubble through there, the extra logging fields are not essential (if it's essential, the exception should probably be handled inside the context).

bossie commented 5 months ago

Problem 3: because it's the Python process that writes JSON logs to a rotating log file to be picked up by filebeat (/var/log/openeo/job_tracker_python*.log*), errors that occur outside of Python do not end up in Kibana (they are only written to stderr), e.g.:

execution.error = 'kinit: Cannot contact any KDC for realm 'VGT.VITO.BE' while getting initial credentials

and

execution.error = 'Error response from daemon: Get https://vito-docker-private.artifactory.vgt.vito.be/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)