Worker jobs seem to fail every so often because of transient request errors, but we currently have no logging or visibility of any sort into worker behaviour.
Some metrics to collect per worker:
Number of succeeded jobs
Number of failed jobs
Mean time to job completion (possibly segmented by succeeded or failed)
Worker jobs seem to fail every so often because of transient request errors, but we currently have no logging or visibility of any sort into worker behaviour.
Some metrics to collect per worker:
Some things to log: