External logging (splunk) sends "job_events" only for some of the hosts per task, not for all

ISSUE TYPE

Bug Report

COMPONENT NAME

SUMMARY

I am using the splunk external logging. I have noticed, that the "awx.analytics.job_events" events get sent to Splunk HEC only for some of the hosts (containing their results) and not for all of them.

Example, following job output in AWX UI:

TASK [Gathering Facts] ********************************************************* 14:03:49
ok: [host1]
ok: [host2]
ok: [host3]
ok: [host4]
ok: [host5]
 [WARNING]: flush_handlers task does not support when conditional

In Splunk, I can do the following search:

index=ansible sourcetype=ansible:awx_events logger_name="awx.analytics.job_events" job="836" task="Gathering facts"

...and I find 4 events:

one with event="playbook_on_task_start" which just says "task has started"
one with event="runner_on_ok", stdout="OK: host3"
one with event="runner_on_ok", stdout="OK: host4"
one with event="warning", stdout="Warning: flush_handlers task does not support when conditional"

This is just one example. I am having this behavior (missing runner_on_ok and also missing runner_on_failed events) for almost ALL tasks. This leads to totally wrong statistics. For example, in the play run above, all five hosts have failed at one task, but AWX only sent back errors for two of them (this time, host1 and host4).

To make sure it's not an issue with Splunk rejecting or losing messages, I have set up Charles Proxy and AWX to push all HTTP requests through Charles proxy. This way, I have been able to see every HTTP request issued by all AWX related docker containers. I have seen exactly the events shown up in Splunk, not more, and also no rejected requests or such.

I have no clue what to do to analyze this further...

ENVIRONMENT

AWX version: 3.0.1.0
AWX install method: docker on linux
Ansible version: 2.7.7
Operating System: CentOS
Web Browser: Safari, Chrome

STEPS TO REPRODUCE

Set up External logging to splunk for loggers: awx, activity_stream, job_events, system_tracking start a job containing at least 5 hosts and 5 tasks, view results in splunk

EXPECTED RESULTS

for the above mentioned test case, a total of 5 events of type "runner_on_ok", one for each host, should be sent to Splunk.

ACTUAL RESULTS

missing events, only some events are being received by Splunk

I see the same issue on Tower 3.7.3.

Maybe the volume of events is too big.

Calling the /api/v2/job_events/ endpoint results in '504 Gateway time-out'

I'm considering modifying the rsyslog rate limits: https://www.rsyslog.com/changing-the-settings/

The SystemLogRateLimitInterval determines the amount of time that is being measured for rate limiting. By default this is set to 5 seconds.

The SystemLogRateLimitBurst defines the amount of messages, that have to occur in the time limit of SystemLogRateLimitInterval, to trigger rate limiting. Here, the default is 200 messages. For creating a more effective test, we will alter the default values.

we're using the default values.

Changing to e.g. :

$SystemLogRateLimitInterval 1 $SystemLogRateLimitBurst 100

might help

ansible / awx