ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.06k stars 3.42k forks source link

External logging (splunk) sends "job_events" only for some of the hosts per task, not for all #3489

Open fibbs opened 5 years ago

fibbs commented 5 years ago
ISSUE TYPE
COMPONENT NAME
SUMMARY

I am using the splunk external logging. I have noticed, that the "awx.analytics.job_events" events get sent to Splunk HEC only for some of the hosts (containing their results) and not for all of them.

Example, following job output in AWX UI:

TASK [Gathering Facts] ********************************************************* 14:03:49
ok: [host1]
ok: [host2]
ok: [host3]
ok: [host4]
ok: [host5]
 [WARNING]: flush_handlers task does not support when conditional

In Splunk, I can do the following search:

index=ansible sourcetype=ansible:awx_events logger_name="awx.analytics.job_events" job="836" task="Gathering facts"

...and I find 4 events:

This is just one example. I am having this behavior (missing runner_on_ok and also missing runner_on_failed events) for almost ALL tasks. This leads to totally wrong statistics. For example, in the play run above, all five hosts have failed at one task, but AWX only sent back errors for two of them (this time, host1 and host4).

To make sure it's not an issue with Splunk rejecting or losing messages, I have set up Charles Proxy and AWX to push all HTTP requests through Charles proxy. This way, I have been able to see every HTTP request issued by all AWX related docker containers. I have seen exactly the events shown up in Splunk, not more, and also no rejected requests or such.

I have no clue what to do to analyze this further...

ENVIRONMENT
STEPS TO REPRODUCE

Set up External logging to splunk for loggers: awx, activity_stream, job_events, system_tracking start a job containing at least 5 hosts and 5 tasks, view results in splunk

EXPECTED RESULTS

for the above mentioned test case, a total of 5 events of type "runner_on_ok", one for each host, should be sent to Splunk.

ACTUAL RESULTS

missing events, only some events are being received by Splunk

jeczkor commented 3 years ago

Hi

I see the same issue on Tower 3.7.3.

Maybe the volume of events is too big.

Calling the /api/v2/job_events/ endpoint results in '504 Gateway time-out'

I'm considering modifying the rsyslog rate limits: https://www.rsyslog.com/changing-the-settings/

The SystemLogRateLimitInterval determines the amount of time that is being measured for rate limiting. By default this is set to 5 seconds.

The SystemLogRateLimitBurst defines the amount of messages, that have to occur in the time limit of SystemLogRateLimitInterval, to trigger rate limiting. Here, the default is 200 messages. For creating a more effective test, we will alter the default values.

we're using the default values.

Changing to e.g. :

$SystemLogRateLimitInterval 1 $SystemLogRateLimitBurst 100

might help