Closed bossie closed 3 months ago
Problem 1: the stderr output of docker rmi
is included in the execution.error
attribute:
execution.error = '"docker rmi" requires at least 1 argument. See 'docker rmi --help'.
This is harmless and does not cause the run to fail but it is confusing. I've adapted the configuration of the ExecuteStreamCommand
processor to not include it in execution.error
.
Problem 2: the execution.error
attribute is cut off before the root cause. In particular it is cut off to a fixed 4000 characters.
Perhaps we should also add a kibana link to the nifi mails.
Good suggestion by @soxofaan as well:
what would also help is to inject the nifi flow id as correlation id in the logs, so that you can zoom in on the logs of a particular run
automatic injecting such a correlation id would be another use case for this PoC PR https://github.com/Open-EO/openeo-python-driver/pull/233
@soxofaan I've been playing with this ExtraLoggingFilter
and an issue that I'm encountering is that log entries for unhandled exceptions do not get the intended extras: https://jenkins.vgt.vito.be/job/openEO/job/openeo-geopyspark-driver/job/707_run_id_ExtraLoggingFilter/1/testReport/junit/tests.test_job_tracker_v2/TestCliApp/test_run_rotating_log/
I suppose it's to be expected because the ExtraLoggingFilter
will restore the extras upon leaving the with
block i.e. before the sys.excepthook
has a chance to output the error.
I'm guessing you have the same issue when you use ExtraLoggingFilter
in the queue_job
implementation and an exception bubbles up.
Maybe BatchJobLoggingFilter
is a better fit in my case? I'm thinking BatchJobLoggingFilter
could also be renamed to something more generic like GlobalLoggingFilter
(and ExtraLoggingFilter
to ThreadLocalLoggingFilter
).
Thoughts?
Indeed, with ExtraLoggingFilter.with_extra_logging
, by design, only injects extras in logging calls that happen during the body of the with
construct. Exceptions triggered from withing the body, but handled outside of it will not receive injections.
If I understand that feature branch correctly, you're looking here into setting a global "run_id". In that case, what is currently called BatchJobLoggingFilter
indeed looks like a better match to do that.
BatchJobLoggingFilter
contains nothing inherently related to batch jobs (except for its documentation) so it makes sense to rename it to GlobalLoggingFilter
or GlobalExtraLoggingFilter
.
Regarding ExtraLoggingFilter
I would rename it to ContextExtraLoggingFilter
, because the defining feature there is to only add extras inside the with
context. The thread local thing is not that vital.
I'm guessing you have the same issue when you use ExtraLoggingFilter in the queue_job implementation and an exception bubbles up.
Good point. I guess the current assumption is that for exceptions that bubble through there, the extra logging fields are not essential (if it's essential, the exception should probably be handled inside the context).
Problem 3: because it's the Python process that writes JSON logs to a rotating log file to be picked up by filebeat (/var/log/openeo/job_tracker_python*.log*
), errors that occur outside of Python do not end up in Kibana (they are only written to stderr), e.g.:
execution.error = 'kinit: Cannot contact any KDC for realm 'VGT.VITO.BE' while getting initial credentials
and
execution.error = 'Error response from daemon: Get https://vito-docker-private.artifactory.vgt.vito.be/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
A run_id
is added to the log entries if you pass it to the job_tracker-entrypoint.sh
script (optional).
TODO: add a Kibana link to the e-mails that looks like this:
Improved the subject line to include the environment: prod
, dev
or integrationtests
.
The body of the e-mail now has a link to this run's logs in Kibana; it also contains all Flowfile attributes including the run_id
, execution.error
(stderr, important for errors that do not originate in Python like kinit
) and, again, deploy_env
.
Stdout of the ExecuteStreamCommand
is attached to the mail for context and debugging purposes.
Aligned prod job tracker version in Nifi to that of the web app. Its logs are being populated with run IDs so now we wait for the errors to come in.
Clicking the link in the first error mail opens Kibana and
Unable to parse URL
A run_id
of (see above) d9396056-7553-47e7-9f18-f8d721b15d12 in the URL works, whereas 23881a3b-b007-4db2-8367-b9dc218593a1 does not. My guess is that Kibana (rightfully) interprets the former as a string because it starts with a letter, whereas it considers the latter a number.
Both seem to work by enclosing the run_id
in single quotes:
Adapted the URL pattern in the e-mail processor to include single quotes around the run_id
.
Originally had flow file ID (${uuid}
) for a run_id
to be able to look it up in Data Provenance but this was apparently still duplicated across prod/dev/integrationtests i.e. not unique for a particular run; these flow file IDs also seem to disappear from Data Provenance rather quickly so ${UUID()}
became the run_id
.
Nifi sends an e-mail if the JobTracker process exits with a non-zero exit status. The root cause is not always clear.