ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14k stars 3.42k forks source link

AWX HA (2 instances) Invalid search term entered. GET returned: 500 A server error has occurred #7440

Closed mjpalacios closed 4 years ago

mjpalacios commented 4 years ago
ISSUE TYPE
SUMMARY

I have installed two instances of AWX into separate data centers for HA purposes and put an F5 GTM in front. The GTM is configured active/active. Both instances are using the same database. Issue is, when both instances of AWX are running simultaneously, job submissions succeed, but job execution log retrievals fail. AWX displays an error message "Invalid search term entered. GET returned: 500 A server error has occurred." This happens even when one instance is disabled in the GTM. The only way to prevent it from happening is to leave only one instance running and fully shutdown the other.

ENVIRONMENT
STEPS TO REPRODUCE
  1. Log into AWX
  2. Launch any job template
  3. Display the job list
  4. Attempt to open the job execution log from the job list.
EXPECTED RESULTS

Job execution details

ACTUAL RESULTS

Window pops up with message:

ERROR!
Invalid search term entered. GET returned: 500 A server error has occurred.
ADDITIONAL INFORMATION

image

ryanpetrello commented 4 years ago

@mjpalacios,

Can you check out the web container logs and see if there's a traceback with details?

mjpalacios commented 4 years ago

I attempted opening the execution log of a job (id=10730) that was submitted while both instances were running and the entries below showed up in the container's log: (URLs and IP addresses have been sanitized)

192.168.1.1 - - [25/Jun/2020:02:40:42 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.8, 192.168.3.11"
2020-06-25 02:40:42,840 DEBUG    awx.main.consumers Channel specific.NGGqtyoc!CruwJTPDabWH left groups {'schedules-changed'} and joined {'jobs-summary', 'job_events-10730'}
2020-06-25 02:40:42,840 DEBUG    Channel specific.NGGqtyoc!CruwJTPDabWH left groups {'schedules-changed'} and joined {'jobs-summary', 'job_events-10730'}
2020-06-25 02:40:43,238 ERROR    django.request Internal Server Error: /api/v2/jobs/10730/
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib64/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/views/generic/base.py", line 71, in view
return self.dispatch(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/generics.py", line 299, in dispatch
return super(APIView, self).dispatch(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 505, in dispatch
    response = self.handle_exception(exc)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 465, in handle_exception
self.raise_uncaught_exception(exc)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 476, in raise_uncaught_exception
raise exc
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 502, in dispatch
    response = handler(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/generics.py", line 208, in get
    return self.retrieve(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/mixins.py", line 56, in retrieve
    return Response(serializer.data)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 562, in data
    ret = super().data
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 260, in data
    self._data = self.to_representation(self.instance)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 3018, in to_representation
    ret = super(JobSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 802, in to_representation
    ret = super(UnifiedJobSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 2779, in to_representation
    ret = super(JobOptionsSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 529, in to_representation
    ret[field.field_name] = field.to_representation(attribute)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/fields.py", line 1905, in to_representation
    return method(value)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 3074, in get_host_status_counts
    counts = obj.job_events.only('event_data').get(event='playbook_on_stats').get_host_status_counts()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 412, in get
    (self.model._meta.object_name, num)
awx.main.models.events.JobEvent.MultipleObjectsReturned: get() returned more than one JobEvent -- it returned 2!
2020-06-25 02:40:43,238 ERROR    django.request Internal Server Error: /api/v2/jobs/10730/
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib64/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/views/generic/base.py", line 71, in view
return self.dispatch(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/generics.py", line 299, in dispatch
return super(APIView, self).dispatch(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 505, in dispatch
    response = self.handle_exception(exc)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 465, in handle_exception
self.raise_uncaught_exception(exc)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 476, in raise_uncaught_exception
raise exc
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/views.py", line 502, in dispatch
    response = handler(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/generics.py", line 208, in get
    return self.retrieve(request, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/mixins.py", line 56, in retrieve
    return Response(serializer.data)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 562, in data
    ret = super().data
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 260, in data
    self._data = self.to_representation(self.instance)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 3018, in to_representation
    ret = super(JobSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 802, in to_representation
    ret = super(UnifiedJobSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 2779, in to_representation
    ret = super(JobOptionsSerializer, self).to_representation(obj)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/serializers.py", line 529, in to_representation
    ret[field.field_name] = field.to_representation(attribute)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/rest_framework/fields.py", line 1905, in to_representation
    return method(value)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/api/serializers.py", line 3074, in get_host_status_counts
    counts = obj.job_events.only('event_data').get(event='playbook_on_stats').get_host_status_counts()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 412, in get
    (self.model._meta.object_name, num)
awx.main.models.events.JobEvent.MultipleObjectsReturned: get() returned more than one JobEvent -- it returned 2!
[pid: 794|app: 0|req: 366/41029] 192.168.5.1 () {52 vars in 2833 bytes} [Thu Jun 25 02:40:43 2020] GET /api/v2/jobs/10730/ => generated 41 bytes in 226 msecs (HTTP/1.1 500) 6 headers in 208 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:43 +0000] "GET /api/v2/jobs/10730/ HTTP/1.1" 500 41 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.123"
2020-06-25 02:40:43,300 DEBUG    awx.main.consumers Channel specific.NGGqtyoc!CruwJTPDabWH left groups {'jobs-summary', 'job_events-10730'} and joined {'schedules-changed'}
2020-06-25 02:40:43,300 DEBUG    Channel specific.NGGqtyoc!CruwJTPDabWH left groups {'jobs-summary', 'job_events-10730'} and joined {'schedules-changed'}
[pid: 788|app: 0|req: 266/41030] 192.168.5.1 () {56 vars in 2930 bytes} [Thu Jun 25 02:40:43 2020] OPTIONS /api/v2/jobs/10730/ => generated 11873 bytes in 229 msecs (HTTP/1.1 200) 10 headers in 408 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:43 +0000] "OPTIONS /api/v2/jobs/10730/ HTTP/1.1" 200 11873 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.122"
[pid: 677|app: 0|req: 944/41031] 192.168.5.1 () {52 vars in 2936 bytes} [Thu Jun 25 02:40:43 2020] GET /api/v2/jobs/10730/job_events/?order_by=-start_line&page=1&page_size=50 => generated 104951 bytes in 265 msecs (HTTP/1.1 200) 11 headers in 424 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:43 +0000] "GET /api/v2/jobs/10730/job_events/?order_by=-start_line&page=1&page_size=50 HTTP/1.1" 200 104951 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.126"
[pid: 794|app: 0|req: 367/41032] 192.168.5.1 () {56 vars in 2934 bytes} [Thu Jun 25 02:40:43 2020] OPTIONS /api/v2/unified_jobs/ => generated 10718 bytes in 130 msecs (HTTP/1.1 200) 10 headers in 400 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:43 +0000] "OPTIONS /api/v2/unified_jobs/ HTTP/1.1" 200 10718 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.123"
[pid: 788|app: 0|req: 267/41033] 192.168.5.1 () {54 vars in 2989 bytes} [Thu Jun 25 02:40:43 2020] GET /api/v2/unified_jobs/?page_size=20&order_by=-finished&not__launch_type=sync&page=4 => generated 53350 bytes in 737 msecs (HTTP/1.1 200) 8 headers in 235 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:44 +0000] "GET /api/v2/unified_jobs/?page_size=20&order_by=-finished&not__launch_type=sync&page=4 HTTP/1.1" 200 53350 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.122"
192.168.5.1 - - [25/Jun/2020:02:40:44 +0000] "OPTIONS /api/v2/unified_jobs/ HTTP/1.1" 200 10718 "https://awx-web-svc-app-ops.origin-dc1-core.domain.org/ " "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "192.168.9.123"
[pid: 779|app: 0|req: 626/41034] 192.168.5.1 () {60 vars in 3034 bytes} [Thu Jun 25 02:40:44 2020] OPTIONS /api/v2/unified_jobs/ => generated 10718 bytes in 112 msecs (HTTP/1.1 200) 8 headers in 235 bytes (1 switches on core 0)
[pid: 794|app: 0|req: 368/41035] 192.168.5.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:44 2020] GET / => generated 11459 bytes in 53 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.5.1 - - [25/Jun/2020:02:40:44 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.9, 192.168.3.14"
[pid: 794|app: 0|req: 369/41036] 192.168.7.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:47 2020] GET / => generated 11459 bytes in 53 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.7.1 - - [25/Jun/2020:02:40:47 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.8, 192.168.3.10"
[pid: 794|app: 0|req: 370/41037] 192.168.7.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:49 2020] GET / => generated 11459 bytes in 54 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.7.1 - - [25/Jun/2020:02:40:49 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.9, 192.168.3.14"
[pid: 779|app: 0|req: 627/41038] 192.168.1.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:52 2020] GET / => generated 11459 bytes in 60 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.1.1 - - [25/Jun/2020:02:40:52 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.8, 192.168.3.13"
[pid: 794|app: 0|req: 371/41039] 192.168.7.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:54 2020] GET / => generated 11459 bytes in 52 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.7.1 - - [25/Jun/2020:02:40:54 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.9, 192.168.3.11"
[pid: 794|app: 0|req: 372/41040] 192.168.8.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:57 2020] GET / => generated 11459 bytes in 58 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.8.1 - - [25/Jun/2020:02:40:57 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.8, 192.168.3.10"
[pid: 779|app: 0|req: 628/41041] 192.168.7.1 () {42 vars in 607 bytes} [Thu Jun 25 02:40:59 2020] GET / => generated 11459 bytes in 55 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
192.168.7.1 - - [25/Jun/2020:02:40:59 +0000] "GET / HTTP/1.1" 200 11459 "-" "-" "192.168.2.9, 192.168.3.10"
RESULT 2
OKREADY
ryanpetrello commented 4 years ago

This looks like a duplicate of https://github.com/ansible/awx/issues/6982 to me.

So we've seen this before, but I'm not really sure how folks have managed to get data into this state: https://github.com/ansible/awx/issues/6982#issuecomment-629380382

ryanpetrello commented 4 years ago

@mjpalacios if you run this query, can you share the output?

select created, modified, uuid, parent_uuid playbook from main_jobevent where job_id=10730 and event='playbook_on_stats';
mjpalacios commented 4 years ago

@ryanpetrello , please find below the query and its output

awx=> select created, modified, uuid, parent_uuid playbook from main_jobevent where job_id=10730 and event='playbook_on_stats';
            created            |           modified            |                 uuid                 |               playbook               
-------------------------------+-------------------------------+--------------------------------------+--------------------------------------
 2020-06-24 14:15:48.682015-05 | 2020-06-24 14:15:49.574287-05 | 0830edea-81a5-465d-ae08-eab8c33495ed | 87f97b99-543b-450f-81f8-ee4aeaaf9ff5
 2020-06-24 14:15:40.935085-05 | 2020-06-24 14:15:41.717927-05 | 4fb0079a-d088-49c3-afe6-ca8bf5a2eece | f76da9c5-3392-48c1-9b0e-5f3fb5f18715
ryanpetrello commented 4 years ago

@mjpalacios can you put a , between parent_uuid and playbook - sorry about that.

select created, modified, uuid, parent_uuid, playbook from main_jobevent where job_id=10730 and event='playbook_on_stats';

mjpalacios commented 4 years ago

@ryanpetrello no problem, see below:

awx=> select created, modified, uuid, parent_uuid, playbook from main_jobevent where job_id=10730 and event='playbook_on_stats';
            created            |           modified            |                 uuid                 |             parent_uuid              |    playbook     
-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+-----------------
 2020-06-24 14:15:48.682015-05 | 2020-06-24 14:15:49.574287-05 | 0830edea-81a5-465d-ae08-eab8c33495ed | 87f97b99-543b-450f-81f8-ee4aeaaf9ff5 | deploy_main.yml
 2020-06-24 14:15:40.935085-05 | 2020-06-24 14:15:41.717927-05 | 4fb0079a-d088-49c3-afe6-ca8bf5a2eece | f76da9c5-3392-48c1-9b0e-5f3fb5f18715 | deploy_main.yml
(2 rows)
ryanpetrello commented 4 years ago

@mjpalacios I'd really like to try to reproduce something like this - is there any way you could share your playbook (or some sanitized, simpler version of it)?

Also, does this happen for every run of this specific job template, or is this the only job you know of that encountered this problem?

ryanpetrello commented 4 years ago

Also, could you run and share (sanitized) this adhoc shell command run under the same (or similar) inventory where you ran for this job?

ansible-config dump

ryanpetrello commented 4 years ago

You know, I wonder if the way you have your RabbitMQ HA setup, if sometimes playbooks are running twice.

In other words, you launch a job, and the same ansible-playbook ends up running on two distinct nodes...

Do you have any evidence of other duplicate events for this one job?

ryanpetrello commented 4 years ago

@mjpalacios does this job have any events that represent fact gathering? If so, to confirm this theory, we could see if there are any duplicate fact gathers that have distinct IPs - that would suggest that your one job ran in duplicate on multiple hosts in your cluster.

mjpalacios commented 4 years ago

@mjpalacios I'd really like to try to reproduce something like this - is there any way you could share your playbook (or some sanitized, simpler version of it)?

Not this playbook in particular, but I can share another one which creates a .kubeconfig file and exhibits the same behavior (job_id=10719)

Also, does this happen for every run of this specific job template, or is this the only job you know of that encountered this problem?

This happened for every run of every job template that was launched while both AWX instances were running. The anomalous behavior stopped as soon as I stopped one of the instances (the one that was disabled at the GTM)

Also, could you run and share (sanitized) this adhoc shell command run under the same (or similar) inventory where you ran for this job? ansible-config dump

Here the output of the command in the context of the 'awx-web' container

ACTION_WARNINGS(default) = True
AGNOSTIC_BECOME_PROMPT(default) = True
ALLOW_WORLD_READABLE_TMPFILES(default) = False
ANSIBLE_CONNECTION_PATH(default) = None
ANSIBLE_COW_PATH(default) = None
ANSIBLE_COW_SELECTION(default) = default
ANSIBLE_COW_WHITELIST(default) = ['bud-frogs', 'bunny', 'cheese', 'daemon', 'default', 'dragon', 'elephant-in-snake', 'elephant', 'eyes', 'hellokitty', 'kitty', 'luke-koala', 'meow', 'milk', 'moofasa', 'moose', 'ren', 'sheep', 'small', 'stegosaurus', 'stimpy', 'supermilker', 'three-eyes', 'turkey', 'turtle', 'tux', 'udder', 'vader-koala', 'vader', 'www']
ANSIBLE_FORCE_COLOR(default) = False
ANSIBLE_NOCOLOR(default) = False
ANSIBLE_NOCOWS(default) = False
ANSIBLE_PIPELINING(default) = False
ANSIBLE_SSH_ARGS(default) = -C -o ControlMaster=auto -o ControlPersist=60s
ANSIBLE_SSH_CONTROL_PATH(default) = None
ANSIBLE_SSH_CONTROL_PATH_DIR(default) = ~/.ansible/cp
ANSIBLE_SSH_EXECUTABLE(default) = ssh
ANSIBLE_SSH_RETRIES(default) = 0
ANY_ERRORS_FATAL(default) = False
BECOME_ALLOW_SAME_USER(default) = False
BECOME_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/become', '/usr/share/ansible/plugins/become']
CACHE_PLUGIN(default) = memory
CACHE_PLUGIN_CONNECTION(default) = None
CACHE_PLUGIN_PREFIX(default) = ansible_facts
CACHE_PLUGIN_TIMEOUT(default) = 86400
COLLECTIONS_PATHS(default) = ['/home/awx/.ansible/collections', '/usr/share/ansible/collections']
COLOR_CHANGED(default) = yellow
COLOR_CONSOLE_PROMPT(default) = white
COLOR_DEBUG(default) = dark gray
COLOR_DEPRECATE(default) = purple
COLOR_DIFF_ADD(default) = green
COLOR_DIFF_LINES(default) = cyan
COLOR_DIFF_REMOVE(default) = red
COLOR_ERROR(default) = red
COLOR_HIGHLIGHT(default) = white
COLOR_OK(default) = green
COLOR_SKIP(default) = cyan
COLOR_UNREACHABLE(default) = bright red
COLOR_VERBOSE(default) = blue
COLOR_WARN(default) = bright purple
COMMAND_WARNINGS(default) = True
CONDITIONAL_BARE_VARS(default) = True
CONNECTION_FACTS_MODULES(default) = {'eos': 'eos_facts', 'frr': 'frr_facts', 'ios': 'ios_facts', 'iosxr': 'iosxr_facts', 'junos': 'junos_facts', 'nxos': 'nxos_facts', 'vyos': 'vyos_facts'}
COVERAGE_REMOTE_OUTPUT(default) = None
COVERAGE_REMOTE_WHITELIST(default) = *
DEFAULT_ACTION_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/action', '/usr/share/ansible/plugins/action']
DEFAULT_ALLOW_UNSAFE_LOOKUPS(default) = False
DEFAULT_ASK_PASS(default) = False
DEFAULT_ASK_VAULT_PASS(default) = False
DEFAULT_BECOME(default) = False
DEFAULT_BECOME_ASK_PASS(default) = False
DEFAULT_BECOME_EXE(default) = None
DEFAULT_BECOME_FLAGS(default) = 
DEFAULT_BECOME_METHOD(default) = sudo
DEFAULT_BECOME_USER(default) = root
DEFAULT_CACHE_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/cache', '/usr/share/ansible/plugins/cache']
DEFAULT_CALLABLE_WHITELIST(default) = []
DEFAULT_CALLBACK_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/callback', '/usr/share/ansible/plugins/callback']
DEFAULT_CALLBACK_WHITELIST(default) = []
DEFAULT_CLICONF_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/cliconf', '/usr/share/ansible/plugins/cliconf']
DEFAULT_CONNECTION_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/connection', '/usr/share/ansible/plugins/connection']
DEFAULT_DEBUG(default) = False
DEFAULT_EXECUTABLE(default) = /bin/sh
DEFAULT_FACT_PATH(default) = None
DEFAULT_FILTER_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/filter', '/usr/share/ansible/plugins/filter']
DEFAULT_FORCE_HANDLERS(default) = False
DEFAULT_FORKS(default) = 5
DEFAULT_GATHERING(default) = implicit
DEFAULT_GATHER_SUBSET(default) = ['all']
DEFAULT_GATHER_TIMEOUT(default) = 10
DEFAULT_HANDLER_INCLUDES_STATIC(default) = False
DEFAULT_HASH_BEHAVIOUR(default) = replace
DEFAULT_HOST_LIST(default) = ['/etc/ansible/hosts']
DEFAULT_HTTPAPI_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/httpapi', '/usr/share/ansible/plugins/httpapi']
DEFAULT_INTERNAL_POLL_INTERVAL(default) = 0.001
DEFAULT_INVENTORY_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/inventory', '/usr/share/ansible/plugins/inventory']
DEFAULT_JINJA2_EXTENSIONS(default) = []
DEFAULT_JINJA2_NATIVE(default) = False
DEFAULT_KEEP_REMOTE_FILES(default) = False
DEFAULT_LIBVIRT_LXC_NOSECLABEL(default) = False
DEFAULT_LOAD_CALLBACK_PLUGINS(default) = False
DEFAULT_LOCAL_TMP(default) = /home/awx/.ansible/tmp/ansible-local-1227ghoww7kz
DEFAULT_LOG_FILTER(default) = []
DEFAULT_LOG_PATH(default) = None
DEFAULT_LOOKUP_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/lookup', '/usr/share/ansible/plugins/lookup']
DEFAULT_MANAGED_STR(default) = Ansible managed
DEFAULT_MODULE_ARGS(default) = 
DEFAULT_MODULE_COMPRESSION(default) = ZIP_DEFLATED
DEFAULT_MODULE_NAME(default) = command
DEFAULT_MODULE_PATH(default) = ['/home/awx/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
DEFAULT_MODULE_UTILS_PATH(default) = ['/home/awx/.ansible/plugins/module_utils', '/usr/share/ansible/plugins/module_utils']
DEFAULT_NETCONF_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/netconf', '/usr/share/ansible/plugins/netconf']
DEFAULT_NO_LOG(default) = False
DEFAULT_NO_TARGET_SYSLOG(default) = False
DEFAULT_NULL_REPRESENTATION(default) = None
DEFAULT_POLL_INTERVAL(default) = 15
DEFAULT_PRIVATE_KEY_FILE(default) = None
DEFAULT_PRIVATE_ROLE_VARS(default) = False
DEFAULT_REMOTE_PORT(default) = None
DEFAULT_REMOTE_USER(default) = None
DEFAULT_ROLES_PATH(default) = ['/home/awx/.ansible/roles', '/usr/share/ansible/roles', '/etc/ansible/roles']
DEFAULT_SCP_IF_SSH(default) = smart
DEFAULT_SELINUX_SPECIAL_FS(default) = ['fuse', 'nfs', 'vboxsf', 'ramfs', '9p', 'vfat']
DEFAULT_SFTP_BATCH_MODE(default) = True
DEFAULT_SQUASH_ACTIONS(default) = ['apk', 'apt', 'dnf', 'homebrew', 'openbsd_pkg', 'pacman', 'pip', 'pkgng', 'yum', 'zypper']
DEFAULT_SSH_TRANSFER_METHOD(default) = None
DEFAULT_STDOUT_CALLBACK(default) = default
DEFAULT_STRATEGY(default) = linear
DEFAULT_STRATEGY_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/strategy', '/usr/share/ansible/plugins/strategy']
DEFAULT_SU(default) = False
DEFAULT_SYSLOG_FACILITY(default) = LOG_USER
DEFAULT_TASK_INCLUDES_STATIC(default) = False
DEFAULT_TERMINAL_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/terminal', '/usr/share/ansible/plugins/terminal']
DEFAULT_TEST_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/test', '/usr/share/ansible/plugins/test']
DEFAULT_TIMEOUT(default) = 10
DEFAULT_TRANSPORT(default) = smart
DEFAULT_UNDEFINED_VAR_BEHAVIOR(default) = True
DEFAULT_VARS_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/vars', '/usr/share/ansible/plugins/vars']
DEFAULT_VAULT_ENCRYPT_IDENTITY(default) = None
DEFAULT_VAULT_IDENTITY(default) = default
DEFAULT_VAULT_IDENTITY_LIST(default) = []
DEFAULT_VAULT_ID_MATCH(default) = False
DEFAULT_VAULT_PASSWORD_FILE(default) = None
DEFAULT_VERBOSITY(default) = 0
DEPRECATION_WARNINGS(default) = True
DIFF_ALWAYS(default) = False
DIFF_CONTEXT(default) = 3
DISPLAY_ARGS_TO_STDOUT(default) = False
DISPLAY_SKIPPED_HOSTS(default) = True
DOCSITE_ROOT_URL(default) = https://docs.ansible.com/ansible/
DOC_FRAGMENT_PLUGIN_PATH(default) = ['/home/awx/.ansible/plugins/doc_fragments', '/usr/share/ansible/plugins/doc_fragments']
DUPLICATE_YAML_DICT_KEY(default) = warn
ENABLE_TASK_DEBUGGER(default) = False
ERROR_ON_MISSING_HANDLER(default) = True
FACTS_MODULES(default) = ['smart']
GALAXY_IGNORE_CERTS(default) = False
GALAXY_ROLE_SKELETON(default) = None
GALAXY_ROLE_SKELETON_IGNORE(default) = ['^.git$', '^.*/.git_keep$']
GALAXY_SERVER(default) = https://galaxy.ansible.com
GALAXY_SERVER_LIST(default) = None
GALAXY_TOKEN(default) = None
GALAXY_TOKEN_PATH(default) = /home/awx/.ansible/galaxy_token
HOST_KEY_CHECKING(default) = True
HOST_PATTERN_MISMATCH(default) = warning
INJECT_FACTS_AS_VARS(default) = True
INTERPRETER_PYTHON(default) = auto_legacy
INTERPRETER_PYTHON_DISTRO_MAP(default) = {'centos': {'6': '/usr/bin/python', '8': '/usr/libexec/platform-python'}, 'fedora': {'23': '/usr/bin/python3'}, 'redhat': {'6': '/usr/bin/python', '8': '/usr/libexec/platform-python'}, 'rhel': {'6': '/usr/bin/python', '8': '/usr/libexec/platform-python'}, 'ubuntu': {'14': '/usr/bin/python', '16': '/usr/bin/python3'}}
INTERPRETER_PYTHON_FALLBACK(default) = ['/usr/bin/python', 'python3.7', 'python3.6', 'python3.5', 'python2.7', 'python2.6', '/usr/libexec/platform-python', '/usr/bin/python3', 'python']
INVALID_TASK_ATTRIBUTE_FAILED(default) = True
INVENTORY_ANY_UNPARSED_IS_FAILED(default) = False
INVENTORY_CACHE_ENABLED(default) = False
INVENTORY_CACHE_PLUGIN(default) = None
INVENTORY_CACHE_PLUGIN_CONNECTION(default) = None
INVENTORY_CACHE_PLUGIN_PREFIX(default) = ansible_facts
INVENTORY_CACHE_TIMEOUT(default) = 3600
INVENTORY_ENABLED(default) = ['host_list', 'script', 'auto', 'yaml', 'ini', 'toml']
INVENTORY_EXPORT(default) = False
INVENTORY_IGNORE_EXTS(default) = {{(BLACKLIST_EXTS + ( '.orig', '.ini', '.cfg', '.retry'))}}
INVENTORY_IGNORE_PATTERNS(default) = []
INVENTORY_UNPARSED_IS_FAILED(default) = False
LOCALHOST_WARNING(default) = True
MAX_FILE_SIZE_FOR_DIFF(default) = 104448
NETCONF_SSH_CONFIG(default) = None
NETWORK_GROUP_MODULES(default) = ['eos', 'nxos', 'ios', 'iosxr', 'junos', 'enos', 'ce', 'vyos', 'sros', 'dellos9', 'dellos10', 'dellos6', 'asa', 'aruba', 'aireos', 'bigip', 'ironware', 'onyx', 'netconf']
OLD_PLUGIN_CACHE_CLEARING(default) = False
PARAMIKO_HOST_KEY_AUTO_ADD(default) = False
PARAMIKO_LOOK_FOR_KEYS(default) = True
PERSISTENT_COMMAND_TIMEOUT(default) = 30
PERSISTENT_CONNECT_RETRY_TIMEOUT(default) = 15
PERSISTENT_CONNECT_TIMEOUT(default) = 30
PERSISTENT_CONTROL_PATH_DIR(default) = /home/awx/.ansible/pc
PLAYBOOK_DIR(default) = None
PLAYBOOK_VARS_ROOT(default) = top
PLUGIN_FILTERS_CFG(default) = None
PYTHON_MODULE_RLIMIT_NOFILE(default) = 0
RETRY_FILES_ENABLED(default) = False
RETRY_FILES_SAVE_PATH(default) = None
SHOW_CUSTOM_STATS(default) = False
STRING_CONVERSION_ACTION(default) = warn
STRING_TYPE_FILTERS(default) = ['string', 'to_json', 'to_nice_json', 'to_yaml', 'ppretty', 'json']
SYSTEM_WARNINGS(default) = True
TAGS_RUN(default) = []
TAGS_SKIP(default) = []
TASK_DEBUGGER_IGNORE_ERRORS(default) = True
TRANSFORM_INVALID_GROUP_CHARS(default) = never
USE_PERSISTENT_CONNECTIONS(default) = False
VARIABLE_PRECEDENCE(default) = ['all_inventory', 'groups_inventory', 'all_plugins_inventory', 'all_plugins_play', 'groups_plugins_inventory', 'groups_plugins_play']
VERBOSE_TO_STDERR(default) = False
YAML_FILENAME_EXTENSIONS(default) = ['.yml', '.yaml', '.json']

You know, I wonder if the way you have your RabbitMQ HA setup, if sometimes playbooks are running twice.

I didn’t mess with RabbitMQ at all, and the AWX version I have (9.3.0) seems to be using Redis instead of RabbitMQ, as I see an 'awx-redis' container but not an 'awx-rabbitmq' one.

In other words, you launch a job, and the same ansible-playbook ends up running on two distinct nodes... Do you have any evidence of other duplicate events for this one job?

Same answer as above, i.e. all the job templates that were submitted while both AWX instances were running were impacted. Now, specific to your question, yes, here’s evidence of duplicates for other jobs resulting from launching the same template:

Job #10709

awx=> select created, modified, uuid, parent_uuid, playbook from main_jobevent where job_id=10709 and event='playbook_on_stats';
created            |           modified            |                 uuid                 |             parent_uuid              |    playbook     
-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+-----------------
2020-06-24 13:15:45.685405-05 | 2020-06-24 13:15:45.990309-05 | 0f1699fc-21a5-4a60-bef2-071a22378210 | 5715f4d1-0c31-4c88-b111-f2ac61f242b5 | deploy_main.yml
2020-06-24 13:15:44.150874-05 | 2020-06-24 13:15:44.883841-05 | e0fb8810-8397-4c68-9e3c-744c3f7cd361 | b68b122a-48b1-4927-b557-6fc908c690cc | deploy_main.yml
(2 rows)

Job #10733

created            |           modified            |                 uuid                 |             parent_uuid              |    playbook     
-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+-----------------
2020-06-24 14:16:33.882129-05 | 2020-06-24 14:16:34.346917-05 | a9d1c0e3-31f4-4ed0-ba58-f96851760bbb | 8af7af44-2d43-49d4-8788-37e4402169e0 | deploy_main.yml
2020-06-24 14:16:32.485674-05 | 2020-06-24 14:16:33.023548-05 | d8d4c7b2-4f84-427f-a0d8-1635277f3fd8 | 14a68061-294e-41f4-9a79-306fa34e190d | deploy_main.yml
(2 rows)

@mjpalacios does this job have any events that represent fact gathering? If so, to confirm this theory, we could see if there are any duplicate fact gathers that have distinct IPs - that would suggest that your one job ran in duplicate on multiple hosts in your cluster.

How do I identify those?

mjpalacios commented 4 years ago

@ryanpetrello Please find below the files comprising the kubeconfig creation playbook (sanitized): This is the playbook corresponding to job 10719 exhibiting the same anomalous behavior as 10709, 10730, 10733

pb-create-kubeconfig.yml

---
- hosts: localhost
  tasks:
  - name: Bail out when required playbook variables are missing
    assert:
      that:
      - (pdc_token is defined and pdc_token)
      - (bdc_token is defined and bdc_token)
      - (svc_acct is defined and svc_acct)
  - name: Include role for creating kubeconfig
    include_role:
      name: create-kubeconfig

roles/create-kubeconfig/tasks/main.yml

---
- name: Instantiate kubeconfig template
  template:
    src: kubeconfig.yml.j2
    dest: "{{target_directory}}/{{kubeconfig_file}}"

roles/create-kubeconfig/templates/kubeconfig.yml.j2

---
apiVersion: v1
clusters:
- cluster:
    server: https://{{pdc_server}}
  name: {{pdc_cluster}}
- cluster:
    server: https://{{bdc_server}}
  name: {{bdc_cluster}}
contexts:
- context:
    cluster: {{pdc_cluster}}
    user: {{svc_acct}}/{{pdc_server}}
  name: {{pdc_cluster}}
- context:
    cluster: {{bdc_cluster}}
    user: {{svc_acct}}/{{bdc_server}}
  name: {{bdc_cluster}}
kind: Config
preferences: {}
users:
- name: {{svc_acct}}/{{pdc_server}}
  user:
    token: {{pdc_token}}
- name: {{svc_acct}}/{{bdc_server}}
  user:
    token: {{bdc_token}}

roles/create-kubeconfig/vars/main.yml

---
target_directory: /tmp
kubeconfig_file: kubeconfig.yml

pdc_server: pdc-app.domain.org:6443
pdc_cluster: app-pdc

bdc_server:  bdc-app.domain.org:6443
bdc_cluster: app-bdc
awx=> select created, modified, uuid, parent_uuid, playbook from main_jobevent where job_id=10719 and event='playbook_on_stats';
            created            |           modified            |                 uuid                 |             parent_uuid              |         playbook         
-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+--------------------------
 2020-06-24 13:48:13.54507-05  | 2020-06-24 13:48:13.765564-05 | 047d2cb8-cdf8-4e37-862f-99b4b0e8c095 | aed09452-c40a-44e6-aebd-b49c5e528b82 | pb-create-kubeconfig.yml
 2020-06-24 13:48:13.534681-05 | 2020-06-24 13:48:13.734054-05 | c16bab3d-c7e5-497b-a822-c9758097ece0 | b9087855-1668-4022-bf5b-ced5c4ac3cc1 | pb-create-kubeconfig.yml
(2 rows)
ryanpetrello commented 4 years ago

@mjpalacios,

I didn’t mess with RabbitMQ at all, and the AWX version I have (9.3.0) seems to be using Redis instead of RabbitMQ, as I see an 'awx-redis' container but not an 'awx-rabbitmq' one.

Something has gone terribly, terribly wrong. RabbitMQ wasn't removed until AWX 10.0.0: https://github.com/ansible/awx/blob/devel/CHANGELOG.md#1000-mar-30-2020

mjpalacios commented 4 years ago

Not sure how to proceed from here. Any suggestions?

ryanpetrello commented 4 years ago

Are you certain you're actually deploying 9.3.0? It doesn't sound like it to me if you're not seeing RabbitMQ anywhere. This makes me wonder if you've got some sort of disparity between your image and the AWX source code you're deploying with.

If awx-manage --version reports 9.3.0, but you're not seeing RabbitMQ containers, then you've got something messed up in your deployment, and it's possible it's causing whatever weirdness you're seeing here.

mjpalacios commented 4 years ago

@ryanpetrello, found that the code version is 10.0.0 but version label was not updated and still had 9.3.0. I'm going to download the latest version and work the way up from dev to prod. This is going to take some time.

ryanpetrello commented 4 years ago

Could be related to https://github.com/ansible/awx/issues/7253?

mjpalacios commented 4 years ago

I have built docker images and deployed AWX 13.0.0 into a non-production OpenShift environment. Will deploy into a production environment (single instance) later this week. Will update this issue after that.

ryanpetrello commented 4 years ago

@mjpalacios I'm still not totally sure what's up here, but I've seen a few similar reports and would like to track all of this one one issue to avoid duplicates:

https://github.com/ansible/awx/issues/7253