ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.66k stars 3.37k forks source link

splunk logging is not working with the default installation #335

Closed oguzy closed 6 years ago

oguzy commented 6 years ago
ISSUE TYPE
COMPONENT NAME
SUMMARY

After setting the splunk logging at the AWX UI, i am not able to see any logs at the splunk interface.

docker -f logs awx_web 

2017-10-02 08:48:39,088 WARNING  awx.api.generics status 500 received by user admin attempting to access /api/v2/settings/logging/test/ from 10.35.100.2
[pid: 443|app: 0|req: 391/8226] 10.35.100.2 () {52 vars in 2293 bytes} [Mon Oct  2 08:48:39 2017] POST /api/v2/settings/logging/test/ => generated 26 bytes in 95 msecs (HTTP/1.1 500) 7 headers in 216 bytes (1 switches on core 0)
10.35.100.2 - - [02/Oct/2017:08:48:39 +0000] "POST /api/v2/settings/logging/test/ HTTP/1.1" 500 37 "https://tower.app.foo.net/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" 
ENVIRONMENT
STEPS TO REPRODUCE

Below is my settings:

screen shot 2017-10-02 at 13 07 15

EXPECTED RESULTS

To see some output at the splunk search:

index=awx

ACTUAL RESULTS

No results

ADDITIONAL INFORMATION

Token is created for submitting logs. inventory outputs is as below:

localhost ansible_connection=local ansible_python_interpreter="/usr/bin/env python"
[all:vars]
use_container_for_build=true
awx_official=false
postgres_data_dir=/mnt/pgdocker
host_port=80
pg_username=awx
pg_password=xxx!
pg_database=awx
pg_port=5432
awx_secret_key=xxx
AlanCoding commented 6 years ago

A couple of things that might be helpful to try here:

matburt commented 6 years ago

poke, can you provide any more information here?

oguzy commented 6 years ago

There is no HTTP but HTTPs, TCP and UDP available for splunk. Tried HTTPs and it didint worked either. Never tried restarting the docker image.

oguzy commented 6 years ago

Tried restarting the image also. Still getting errors when i click the test.

oguzy commented 6 years ago

What kind of information can i share?

AlanCoding commented 6 years ago

The test endpoint is supposed to be returning a fairly verbose message about what went wrong with the request (the request the server makes to your logging aggregator). If you haven't seen this, it could be that the UI just isn't passing it through. If you could either use the API browser to submit the data for the test, or if you could test in the UI and gather the (redacted, of course) request and response data, that should tell us either what configuration is causing trouble, or what message the server is responding.

Independently check collector

For Splunk, we always like to start with independent verification of the event receiver, from their docs:

http://docs.splunk.com/Documentation/SplunkCloud/6.6.3/Data/UsetheHTTPEventCollector

curl -k https://hec.example.com:8088/services/collector/event -H "Authorization: Splunk B5A79AAD-D822-46CC-80D1-819F80D7BFB0" -d '{"event": "hello world"}' {"text": "Success", "code": 0}

If you have replicated this, great, that supports the expectation that AWX should be sending the message successfully with those exact values.

Configuration references

Additionally, if you have a look at the README.md in this folder:

https://github.com/ansible/awx/tree/devel/tools/docker-compose

You can see request data that is verified to work and completely reproducible with the container-based logstash we have in a docker-compose target.

Splunk is only slightly different. The way they do the port is also different, we should be replacing the port value in the URL with the port you give, but it's easier to not muck with that, and it'll just use the URL as-is.

{
    "LOG_AGGREGATOR_HOST": "https://hec.example.com:8088/services/collector/event",
    "LOG_AGGREGATOR_PORT": null,
    "LOG_AGGREGATOR_TYPE": "splunk",
    "LOG_AGGREGATOR_USERNAME": "",
    "LOG_AGGREGATOR_PASSWORD": "B5A79AAD-D822-46CC-80D1-819F80D7BFB0",
    "LOG_AGGREGATOR_LOGGERS": [
        "awx",
        "activity_stream",
        "job_events",
        "system_tracking"
    ],
    "LOG_AGGREGATOR_INDIVIDUAL_FACTS": false,
    "LOG_AGGREGATOR_ENABLED": true,
    "LOG_AGGREGATOR_LEVEL": "INFO"
}

(URL and token is from Splunk example, not meant to be used)

In addition to this, you might want to try toggling protocol, verify cert parameters. But once you confirm the endpoint is accepting messages with your token, and then see the test response, you should know what type of error the server is responding with.

oguzy commented 6 years ago
# curl -k http://10.35.75.26:8088/services/collector/event -H "Authorization: Splunk 759D1202-BD2A-4F09-B74A-0328B32D66FE" -d '{"event": "hello world"}'
{"text":"Success","code":0}

I found the problem:

Changing logging aggregator part to http://10.35.75.26:8088/services/collector/event worked. Before i was just writing the ip address. Test successful now. But still now able to see the template run outputs at the splunk, though i see the test output.

oguzy commented 6 years ago

I am choosing job template log level as normal and log aggregator level as info. Tried different options also. None of them helped me see the output log at the splunk.

oguzy commented 6 years ago

Can it be because of my project's ansible.cfg has the below lines?

[defaults]
host_key_checking = False
stdout_callback = json
AlanCoding commented 6 years ago

No, I don't think so. When AWX runs a job, it will set environment variables that should override those anyway.

Given that you have the logging test working, then I would investigate the logs during the container startup (to see if something obviously failed applying the logging configuration), and the logs during an action that is expected to be emitting a log. If both of these are silent, while you have the config that should work saved to the database, then that makes this tricky, and there could be something specific to this particular install method that I haven't thought of.

oguzy commented 6 years ago

I updated the git repo and reinstall again. Without changing connection methot from https but just adding the url address now i can see awx activiy logs. screen shot 2017-10-17 at 08 55 32

But still now able to see the job template logs. Is this normal? Should i usee some callback to send job logs to splunk?

AlanCoding commented 6 years ago

Yes, the configuration you posted was obviously enabling the above loggers in addition to the job_events. These are emitted as you run a job on a task-by-task basis, containing task-related data. Actually, if that's the configuration you used, the DEBUG messages from celery beat shouldn't be showing up either. Something is likely wrong related to the container setup combined with the logging configuration. I don't have any good ideas right now, because both beat (which you are getting messages from) and the callback receiver are running under supervisor in the same container.

matburt commented 6 years ago

We may be setting up something incorrectly for base logging here? https://github.com/ansible/awx/blob/devel/installer/image_build/files/settings.py

AlanCoding commented 6 years ago

^ I think that LOGGING['loggers']['awx']['handlers'] = ['console'] should lead to a conflict. For the awx logger, we leave in a NullHandler specifically for the purpose of swapping out for the external logger, depending on what the DB settings are on startup. Since the awx logger only has the console handler there, something probably won't work. That should cause the beat messages and other task system logs to be dropped, but shouldn't affect the specifically enumerated analytics loggers, like job_events. That's the opposite of what was observed.

FloThinksPi commented 6 years ago

Also experiencing this with Logstash, there are no logs transfered. Only testmessages get send correctly. @matburt i also suspect some initial logging settings because the AWX container logged to stdout and not to the files shown in the tower documentation.

Best regards Flo

FloThinksPi commented 6 years ago

So my settings are following on a stock Container setup straight from Docker Hub:

"LOG_AGGREGATOR_HOST": "172.17.0.1",
    "LOG_AGGREGATOR_PORT": 5000,
    "LOG_AGGREGATOR_TYPE": "logstash",
    "LOG_AGGREGATOR_USERNAME": "",
    "LOG_AGGREGATOR_PASSWORD": "",
    "LOG_AGGREGATOR_LOGGERS": [
        "awx",
        "activity_stream",
        "job_events",
        "system_tracking"
    ],
    "LOG_AGGREGATOR_INDIVIDUAL_FACTS": true,
    "LOG_AGGREGATOR_ENABLED": true,
    "LOG_AGGREGATOR_TOWER_UUID": "",
    "LOG_AGGREGATOR_PROTOCOL": "tcp",
    "LOG_AGGREGATOR_TCP_TIMEOUT": 5,
    "LOG_AGGREGATOR_VERIFY_CERT": true,
    "LOG_AGGREGATOR_LEVEL": "INFO",

And i can confirm the Logs only show up via the Docker Logs e.g docker logs -f awx_task. Furthermore when using the docker-latest of RHEL, all logs get smashed together in /var/log/messages so i need external logging : ). AFAIK the external logging system seems not to be functioning at all. @AlanCoding do you have a recommendation how to Debug this issue propperly ?

I already tried to modify the /etc/tower/settings.py on both containers. But after restarting them i could not abserve any change in the Logging behaviour.

AlanCoding commented 6 years ago

The thing that I would be hoping for when debugging this is for you to hit this log message:

https://github.com/ansible/awx/blob/0579db1162146d65a8cc3aed19da76e81b149ae6/awx/main/utils/handlers.py#L87

In the event of connection errors, it will log (just once) to the local log files the traceback related to the connection error. In the absence of seeing that, there's little that I can speculate about. The only avenue is to start replicating from the ground-up with the open-source installer, tracing the code that applies the logger according to the database settings, which is pretty non-trivial.

FloThinksPi commented 6 years ago

@AlanCoding I think the Logger works fine. Didnt see that exception up on Container creation nor later in the usage of AWX. In my Opinion, the Code for external Logging works because

https://github.com/ansible/awx/blob/0579db1162146d65a8cc3aed19da76e81b149ae6/awx/main/utils/handlers.py#L208

gives a valid entry in Logstash. So i suspect the configuration of the complete Log Setup to be messed up, not the actual logging mechanism.

So how about these two files ?

https://github.com/ansible/awx/blob/bfea00f6dc6af0fb01057ce38e9d0337e6c589aa/awx/settings/development.py#L22-L28

and

https://github.com/ansible/awx/blob/3118092e51ff037bbccd48b22cf8dabfda53516b/awx/settings/production.py#L53-L59

Do the containers accentently use the development.py settings instead of the production.py values ?

AlanCoding commented 6 years ago

I think the local settings is awx/installer/image_build/files/settings.py put there by the Dockerfile. In the open source installer, I think that it should still use production settings, so if I'm correct, the final settings are combination of defaults.py, production.py and settings.py. The final settings.py get the final say in the logging settings. These will interact with the dynamic logger setup here (using the values from the database which you save in the UI / API browser):

https://github.com/ansible/awx/blob/8666512d994bf9d81061d8df4955604cb66db57e/awx/main/utils/handlers.py#L341

But nothing is jumping out at me right now. Even though the handler is set to console, it should append the new handler for the external logger to that logger. This might inconsistently happen between celery and the web server containers, I really don't know.

mattelphick commented 6 years ago

I think I am experiencing the same/similar issue. I can see test log messages in Splunk and also messages that look like they are coming from the main awx service_logs but I don't think I'm seeing anything from activity_stream, job_events or system_tracking.

Not sure if it's relevant but when I click "REVERT" on the "LOGGERS SENDING DATA TO LOG AGGREGATOR FORM" field it removes all the spaces. Then if I click save it just hangs with the "Working" cog spinning.

I am running version AWX 1.0.1.153 and Ansible 2.4.1.0

Any ideas? Thanks

AlanCoding commented 6 years ago

^ I could not replicate the UI thing on that specific "revert" button, but clicking around other places I got it to hang.

I saw this in the JS logs:

angular.js:14794 TypeError: input.replace is not a function
    at Object.listToArray (configurationUtils.service.js:31)
    at configuration.controller.js:414
    at arrayEach (index.js:1361)
    at Function.<anonymous> (index.js:3405)
    at getFormPayload (configuration.controller.js:390)
    at exports.default.formSave (configuration.controller.js:442)
    at fn (eval at compile (angular.js:15645), <anonymous>:4:164)
    at callback (angular.js:27451)
    at Scope.$eval (angular.js:18521)
    at Scope.$apply (angular.js:18620)

This should be a new UI issue. Are these logs the same as what you can see?

mattelphick commented 6 years ago

Sorry, I'm fairly new to AWX. Where would I find those logs.

mattelphick commented 6 years ago

I have also noticed that I am getting debug messages through to Splunk even though AWX is only configured for INFO messages.

I had changed the level to DEBUG as part of my testing but I have changed it back through the GUI.

{ [-] @timestamp: 2017-12-12T19:22:21.687Z
cluster_host_id: awx
host: awx
level: DEBUG
logger_name: awx.main.tasks message: Cluster node heartbeat task.
type: splunk }

AlanCoding commented 6 years ago

The JavaScript logs like the one I pasted can be recorded in Chrome by right clicking -> inspect -> Console

mattelphick commented 6 years ago

Here's what I see...

Failed to load resource: the server responded with a status of 400 (BAD REQUEST)
vendor.4c1b9c38e5c8ab72ad49.js:436 TypeError: input.replace is not a function
    at Object.listToArray (app.b6a5e3c7df10a96d5ccf.js:403)
    at app.b6a5e3c7df10a96d5ccf.js:403
    at arrayEach (app.b6a5e3c7df10a96d5ccf.js:29)
    at Function.<anonymous> (app.b6a5e3c7df10a96d5ccf.js:29)
    at getFormPayload (app.b6a5e3c7df10a96d5ccf.js:403)
    at exports.default.formSave (app.b6a5e3c7df10a96d5ccf.js:403)
    at fn (eval at compile (vendor.4c1b9c38e5c8ab72ad49.js:436), <anonymous>:4:164)
    at callback (vendor.4c1b9c38e5c8ab72ad49.js:436)
    at Scope.$eval (vendor.4c1b9c38e5c8ab72ad49.js:436)
    at Scope.$apply (vendor.4c1b9c38e5c8ab72ad49.js:436)
(anonymous) @ vendor.4c1b9c38e5c8ab72ad49.js:436
oguzy commented 6 years ago

Mine is working with luck.

On Tue, Dec 12, 2017 at 11:51 PM, mattelphick notifications@github.com wrote:

Here's what I see...

Failed to load resource: the server responded with a status of 400 (BAD REQUEST) vendor.4c1b9c38e5c8ab72ad49.js:436 TypeError: input.replace is not a function at Object.listToArray (app.b6a5e3c7df10a96d5ccf.js:403) at app.b6a5e3c7df10a96d5ccf.js:403 at arrayEach (app.b6a5e3c7df10a96d5ccf.js:29) at Function. (app.b6a5e3c7df10a96d5ccf.js:29) at getFormPayload (app.b6a5e3c7df10a96d5ccf.js:403) at exports.default.formSave (app.b6a5e3c7df10a96d5ccf.js:403) at fn (eval at compile (vendor.4c1b9c38e5c8ab72ad49.js:436), :4:164) at callback (vendor.4c1b9c38e5c8ab72ad49.js:436) at Scope.$eval (vendor.4c1b9c38e5c8ab72ad49.js:436) at Scope.$apply (vendor.4c1b9c38e5c8ab72ad49.js:436) (anonymous) @ vendor.4c1b9c38e5c8ab72ad49.js:436

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ansible/awx/issues/335#issuecomment-351190160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF89BTeMJnKPzuT2tqQCixSAcRxAFkUks5s_udMgaJpZM4PqcFU .

-- Oğuz Yarımtepe http://about.me/oguzy

mattelphick commented 6 years ago

Does anyone have any further insight into what the issue might be? Is there an underlying config file that I can check to see what's happening? Thanks.

FloThinksPi commented 6 years ago

Hi Together,

Not sure if it's relevant but when I click "REVERT" on the "LOGGERS SENDING DATA TO LOG AGGREGATOR FORM" field it removes all the spaces. Then if I click save it just hangs with the "Working" cog spinning.

I also experience this. AWX 1.0.1.240

image

When clicking revert:

image

When clicking revert it requests: https://awx.xxxx.com/api/v2/settings/all/ and in the respose there are the aggregators "LOG_AGGREGATOR_LOGGERS":["awx","activity_stream","job_events","system_tracking"]

When Saving there is this JS error. And the Page hangs with a loading wheel.

vendor.de7dd9b7b75a5ad9b8e5.js:436 TypeError: input.replace is not a function
    at Object.listToArray (app.26612aa730614b7b6215.js:403)
    at app.26612aa730614b7b6215.js:403
    at arrayEach (app.26612aa730614b7b6215.js:29)
    at Function.<anonymous> (app.26612aa730614b7b6215.js:29)
    at getFormPayload (app.26612aa730614b7b6215.js:403)
    at exports.default.formSave (app.26612aa730614b7b6215.js:403)
    at fn (eval at compile (vendor.de7dd9b7b75a5ad9b8e5.js:436), <anonymous>:4:164)
    at callback (vendor.de7dd9b7b75a5ad9b8e5.js:436)
    at Scope.$eval (vendor.de7dd9b7b75a5ad9b8e5.js:436)
    at Scope.$apply (vendor.de7dd9b7b75a5ad9b8e5.js:436)
(anonymous) @ vendor.de7dd9b7b75a5ad9b8e5.js:436
(anonymous) @ vendor.de7dd9b7b75a5ad9b8e5.js:436
$apply @ vendor.de7dd9b7b75a5ad9b8e5.js:436
(anonymous) @ vendor.de7dd9b7b75a5ad9b8e5.js:436
dispatch @ vendor.de7dd9b7b75a5ad9b8e5.js:29
elemData.handle @ vendor.de7dd9b7b75a5ad9b8e5.js:29

But i don´t suspect this to be the actual logging problem where the external logging mechanism is not functional. The Streams are correctly set as you can see from https://awx.xxxx.com/api/v2/settings/all/. This issue is just a UI Problem. Maybe it parses the Json response wrong and inserts it without spaces.


For a Productive environment a functional logging mechanism is a important feature because with a central automation system one have to comply to company regulations which force actions to be traceable to each administrator/person. So it is important for us to get this working and i am willing to invest some time in this issue.

I think I am experiencing the same/similar issue. I can see test log messages in Splunk and also messages that look like they are coming from the main awx service_logs but I don't think I'm seeing anything from activity_stream, job_events or system_tracking.

what you described here is also the case with an ELK stack as external logger. And thats the real point of this issue. As far as external logging goes it seems to not function regardless of the external logging software. So @AlanCoding how could we advance this issue. We need some info how logging works in the docker environment or should work. What can we do to debug this, as it seems to be easy to reproduce?

Best regards from Germany Flo

FloThinksPi commented 6 years ago

Quite frankly, my external logger now seems to work. I did an update to 1.0.2 , still the external logger was not working and only printing my Test calls. Than i flipped the External Logger Switch: image ON->OFF->ON

Strange 😃 , @oguzy & @mattelphick can you confirm this for Splunk ?

mattelphick commented 6 years ago

I am still on 1.0.1.153. I toggled the External Logger Switch and clicked save. However I don't think I am seeing the logs for "job_events" still.

That said, having never seen what successful logging looks like maybe I am interpreting the output incorrectly.

I do see a message that looks like this...

search___splunk_7_0_0

...but I am not sure which logger generates this message.

What I am hoping to see from the "job_events" logger is more detail like this...

ansible_awx___623_-_confluence_netbox_device_types

Am I misinterpreting things? What sort of detail should I see from the "job_events" logger?

Thanks

mattelphick commented 6 years ago

Can anyone help with the question above? Thanks

mattelphick commented 6 years ago

We have upgraded to version 1.0.2.303. However I am still not clear on whether the loggers are sending data externally.

This is the list of logger_names that I see. How do the logger_names in the log output relate to the four loggers in the configuration section (awx, activity_stream, job_events, system_tracking)?

image

FloThinksPi commented 6 years ago

Hi @mattelphick

I can look it up at the company on Monday but as far as i remember i have simmilar logs in ELK. Mainly it was important for us to log changes by users on templates, projects etc. which gets correctly written to the external logger aswell as what the specific user changed. I dont know if the actual output of a playbook gets logged. You can search in your logs for User, there should be log entrys for each changed object, what was changed and by whom. I can only confirm that i get those logs for sure. But as far as i can see the logging itself seems to work. You get more than just test messages in splunk.

I also dont know how the log chanels in the settings relate to the log tags nor does the documentation of ansible tower tell some more infos about these other than:

https://docs.ansible.com/ansible-tower/latest/html/administration/logging.html#set-up-logging-with-tower

Loggers to Send Data to the Log Aggregator Form: All four types of data are pre-populated by default. Click the tooltip help icon next to the field for additional information on each data type. Delete the data types you do not want.

Best Regards Flo

matburt commented 6 years ago

No comment in 3 months, closing. Please provide more detail if you have it.