Closed geerlingguy closed 4 years ago
Follow-up to #1.
Just switching the tags like so:
tower_task_image: registry.access.redhat.com/ansible-tower-35/ansible-tower:3.5.3
tower_web_image: registry.access.redhat.com/ansible-tower-35/ansible-tower:3.5.3
Results in:
And in the logs:
2019-11-11 19:17:17,444 WARNING awx.conf.settings Database settings are not available, using defaults, error:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/handlers/wsgi.py", line 157, in __call__
response = self.get_response(request)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/handlers/base.py", line 131, in get_response
response = middleware_method(request, response)
File "/middleware.py", line 54, in process_response
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1306, in info
self._log(INFO, msg, args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1442, in _log
self.handle(record)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1452, in handle
self.callHandlers(record)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1514, in callHandlers
hdlr.handle(record)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 859, in handle
rv = self.filter(record)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 718, in filter
result = f.filter(record)
File "/filters.py", line 91, in filter
File "/filters.py", line 38, in __get__
File "/settings.py", line 543, in __getattr_without_cache__
File "/settings.py", line 447, in __getattr__
File "/settings.py", line 390, in _get_local
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 567, in first
objects = list((self if self.ordered else self.order_by('pk'))[:1])
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 250, in __iter__
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 1121, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 53, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
raise original_exception
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
cursor.execute(sql, params)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/utils.py", line 94, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/utils/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
ProgrammingError: relation "conf_setting" does not exist
LINE 1: ...f_setting"."value", "conf_setting"."user_id" FROM "conf_sett...
^
So it seems something is different in the config between Tower 3.5 and AWX 9.x?
It looks like the database initialization is not done automatically for Tower, only for AWX. So I had to:
$ kubectl exec -it -n example-tower example-tower-tower-6858559bcd-crc75 bash
bash$ awx-manage migrate --noinput
I'll have to add something to the operator that checks if this is a fresh install, and runs the migration if it needs to be run.
After that, it looks like the tower_admin_user
and tower_admin_password
weren't consumed as they are when installing AWX... so I need to figure out what they are. The install guide kinda hints at them being admin
and password
, but that didn't work either.
RE: the above two comments; for Tower, the OpenShift setup playbook contains the following tasks (all these things seem to be done by AWX automatically when first setting it up as long as your env vars and config are correct; so not sure why it's not the same for Tower):
- name: Migrate database
shell: |
{{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
bash -c "awx-manage migrate --noinput"
- name: Check for Tower Super users
shell: |
{{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
bash -c "echo 'from django.contrib.auth.models import User; nsu = User.objects.filter(is_superuser=True).count(); exit(0 if nsu > 0 else 1)' | awx-manage shell"
register: super_check
ignore_errors: yes
changed_when: super_check.rc > 0
- name: create django super user if it does not exist
shell: |
{{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
bash -c "echo \"from django.contrib.auth.models import User; User.objects.create_superuser('{{ admin_user }}', '{{ admin_email }}', '{{ admin_password }}')\" | awx-manage shell"
no_log: yes
when: super_check.rc > 0
- name: update django super user password
shell: |
{{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
bash -c "awx-manage update_password --username='{{ admin_user }}' --password='{{ admin_password }}'"
no_log: yes
register: result
changed_when: "'Password updated' in result.stdout"
- name: Create the default organization if it is needed.
shell: |
{{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
bash -c "awx-manage create_preload_data"
register: cdo
changed_when: "'added' in cdo.stdout"
when: create_preload_data | bool
So asking more about this from some Ansible devs, I found out that the automatic stuff that's done is part of the AWX Docker image installation convenience script:
For OpenShift/Kubernetes installs, it looks like this is the command used for the task
container (/usr/bin/launch_awx_task.sh
):
And the default Dockerfile CMD is also set to it (CMD /usr/bin/launch_awx_task.sh
):
So... I guess I'll just have to detect if we're installing Tower or AWX, and from that decide whether to do the extra steps.
Looks like there is no user account (used psql
to connect inside the Tower container):
awx=# select * from auth_user
awx-# ;
id | password | last_login | is_superuser | username | first_name | last_name | email | is_staff | is_active | date_joined
----+----------+------------+--------------+----------+------------+-----------+-------+----------+-----------+-------------
(0 rows)
So I ran:
echo "from django.contrib.auth.models import User; User.objects.create_superuser('test', 'test@example.com', 'changeme')" | awx-manage shell
And now I'm on the license page, logged in. Nice!
To achieve everything automatically, I'm going to need the k8s_exec
module that's in this PR: https://github.com/ansible/ansible/pull/55029
I'll probably toss it into the tower
role's library
directory and call it a day for now... just wish it could've been merged into Ansible sooner :P
That module is giving me:
File \"/usr/lib/python2.7/site-packages/kubernetes/stream/ws_client.py\", line 255, in websocket_call
raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 403 Forbidden
In the above commit, I split the example CRs, with one for AWX and one for Tower. That way I can continue using the AWX one in the CI tests (at least for now... eventually I'll want to test both AWX and local...).
At this point I'm getting:
TASK [tower : Migrate the database if the K8s resources were updated.] *********
task path: /opt/ansible/roles/tower/tasks/main.yml:32
fatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py:496: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict=yaml.load(f),
Traceback (most recent call last):
File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 114, in <module>
_ansiballz_main()
File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 106, in _ansiballz_main
invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 49, in invoke_module
imp.load_module('__main__', mod, module, MOD_DESC)
File \"/tmp/ansible_k8s_exec_payload_bavjVr/__main__.py\", line 136, in <module>
File \"/tmp/ansible_k8s_exec_payload_bavjVr/__main__.py\", line 123, in main
File \"/usr/lib/python2.7/site-packages/kubernetes/stream/stream.py\", line 32, in stream
return func(*args, **kwargs)
File \"/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py\", line 835, in connect_get_namespaced_pod_exec
(data) = self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs)
File \"/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py\", line 935, in connect_get_namespaced_pod_exec_with_http_info
collection_formats=collection_formats)
File \"/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py\", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File \"/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py\", line 155, in __call_api
_request_timeout=_request_timeout)
File \"/usr/lib/python2.7/site-packages/kubernetes/stream/stream.py\", line 27, in _intercept_request_call
return ws_client.websocket_call(config, *args, **kwargs)
File \"/usr/lib/python2.7/site-packages/kubernetes/stream/ws_client.py\", line 255, in websocket_call
raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 200 OK
", "module_stdout": "", "msg": "MODULE FAILURE
See stdout/stderr for the exact error", "rc": 1}
Some testing—on the command line, I can run:
$ kubectl exec -n example-tower example-tower-tower-6858559bcd-pbghh date
Fri Nov 15 20:10:51 UTC 2019
Testing in the operator playbook:
- name: Test a simple command.
k8s_exec:
namespace: '{{ meta.namespace }}'
pod: '{{ tower_pod_name }}'
command: date
register: date_result
- debug: var=date_result
It results in:
raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 200 OK
Digging a little bit, it seems that can happen if you're hitting an endpoint that's not actually a websocket; see https://stackoverflow.com/a/40110656/100134
So maybe the module's not finding the right URL to hit when it's running inside the Operator? Could it be an Ansible 2.8 issue (I believe I'm running 2.9 externally)? Going to do some more digging...
Running the same task on my host against Minikube with ansible===2.9.1
, I had no problem.
I ran pip3 uninstall ansible
and pip3 install ansible===2.8.0
to match the version inside the Operator image, and the command still worked fine. So it's definitely something with running it from inside the cluster vs. running it from outside :/
Inside the container I was hitting:
bash-4.2$ ansible-playbook test.yml
PLAY [localhost] *******************************************************************************************************
TASK [Get the Tower web pod information.] ******************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'getpwuid(): uid not found: 1001'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}
It looks like the Ansible Operator Dockerfile adds environment information for an ansible-operator
user (see: https://github.com/operator-framework/operator-sdk/blob/master/ci/dockerfiles/ansible.Dockerfile#L16-L19), but since OpenShift assigns a random UID on container start, that user is not added to /etc/passwd
. I added the following line:
ansible-operator:x:1001:1001:ansible-operator user:/opt/ansible:/sbin/nologin
And the Ansible/Python getpwuid errors went away.
So this is fun. If I create the following playbook inside the running ansible
container of the tower-operator
Pod:
- hosts: localhost
connection: local
gather_facts: false
tasks:
- name: Get the Tower web pod information.
# TODO: Change to k8s_info after Ansible 2.9.0 is available in Operator image.
k8s_facts:
kind: Pod
namespace: example-tower
label_selectors:
- app=tower
register: tower_pods
- name: Set the tower pod name as a variable.
set_fact:
tower_pod_name: "{{ tower_pods['resources'][0]['metadata']['name'] }}"
- name: Verify tower_pod_name is populated.
assert:
that: tower_pod_name != ''
fail_msg: "Could not find the tower pod's name."
- name: Test a simple command.
k8s_exec:
namespace: example-tower
pod: '{{ tower_pod_name }}'
command: date
register: date_result
- debug: var=date_result
Then I get the result:
TASK [Test a simple command.] ******************************************************************************************
changed: [localhost]
TASK [debug] ***********************************************************************************************************
ok: [localhost] => {
"date_result": {
"changed": true,
"failed": false,
"stderr": "",
"stderr_lines": [],
"stdout": "Fri Nov 15 20:52:14 UTC 2019\n",
"stdout_lines": [
"Fri Nov 15 20:52:14 UTC 2019"
]
}
}
PLAY RECAP *************************************************************************************************************
localhost : ok=5 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
So it seems that something is different when it runs through ansible-runner
? This is extremely puzzling.
@fabianvf and I were discussing this in the CoreOS slack, and it could be that the proxy set up inside the Ansible Operator between K8s and Ansible runs might be intercepting the websockets request and not proxying the connection cleanly... was glancing through https://github.com/operator-framework/operator-sdk/tree/424a61d56000e6e3d91d352faa1bd4f7c814661f/internal/scaffold/ansible and will have to dig a little deeper.
One other possibility: Install kubectl
inside the operator image, and use command: kubectl exec [stuff]
.
Opened an upstream issue https://github.com/operator-framework/operator-sdk/issues/2204, as it does seem related to the ansible operator's proxy.
I have everything working—I think—to get Tower automatically installed and operating now, but using kubectl
instead of k8s_exec
. I'm going to work on finishing this issue up, and move the work of getting k8s_exec
working into https://github.com/geerlingguy/tower-operator/issues/8
Now when I run jobs they're never starting, and the logs on the task Pod instance seem to indicate there could be some issues:
celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')
...
psycopg2.errors.UndefinedColumn: column main_instancegroup.credential_id does not exist
... [much later] ...
2019-11-18 21:18:13,932 DEBUG awx.main.scheduler Running Tower task manager.
2019-11-18 21:18:13,940 DEBUG awx.main.scheduler Starting Scheduler
2019-11-18 21:18:14,016 DEBUG awx.main.scheduler project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,066 DEBUG awx.main.scheduler Dependent project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,078 DEBUG awx.main.scheduler job 1 (pending) is blocked from running
2019-11-18 21:18:14,147 DEBUG awx.main.scheduler Dependent project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,159 DEBUG awx.main.scheduler job 3 (pending) is blocked from running
2019-11-18 21:18:14,165 DEBUG awx.main.dispatch task 743791cf-dac7-49db-870a-a44b482b4530 is finished
And the last messages repeat over and over as it seems to be trying to kick off jobs but is not successful.
(For the first item, see #3).
It looks like in the AWX/Tower OpenShift installer, it uses a sidecar pod to provide celery... or something strange like that. It's running the command /usr/bin/launch_awx_task.sh
and has the privileged
context (which is a little odd... but maybe it needs it?).
So I added the privileged
context, and started it up again, and now am getting:
Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1255, in run self.pre_run_hook(self.instance, private_data_dir) File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1761, in pre_run_hook raise RuntimeError(msg) RuntimeError: The project revision for this job template is unknown due to a failed update.
And in the backend:
2019-11-18 21:30:39,791 ERROR awx.main.tasks job 1 (running) Exception occurred while running task
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1255, in run
self.pre_run_hook(self.instance, private_data_dir)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1761, in pre_run_hook
raise RuntimeError(msg)
RuntimeError: The project revision for this job template is unknown due to a failed update.
That seems to be related to the initial SCM sync job, which errored out with the following after I restarted the tower task container:
Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.
And now everything seems to be working, after manually re-running the SCM sync job for the Demo Project...
Looking good. Next up: time to delete everything and build from scratch to verify it works OOTB.
It takes about 10m for everything to come up on first run, but the task container still runs into the following when I run the first job on it:
2019-11-18 22:19:01,692 DEBUG awx.main.scheduler project_update 1 (pending) couldn't be scheduled on graph, waiting for next cycle
If I delete the task pod, then wait for its replacement, then monitor it, it seems to at least bump jobs from 'Pending' to 'Waiting'... and then it takes some time for new jobs to be processed. Maybe just a weird first-time setup thing. But I'll probably take a deeper look at it later. Don't want to have to be restarting the task container all the time...
Side note—one other error that occurs on startup every time:
Using /etc/ansible/ansible.cfg as config file
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ImportError: No module named psycopg2
127.0.0.1 | FAILED! => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"msg": "Failed to import the required Python library (psycopg2) on example-tower-tower-task-dcbf4bdcb-k8hjg's Python /usr/bin/python. Please read module documentation and install in the appropriate location"
}
If this next test passes, I'm going to test that AWX still works the same, and if so, close out this issue as complete.
Yay, test passed! Just need to test that AWX works similarly to Tower, then I'll close the issue. Day is wrapping up so it'll have to be later or tomorrow.
AWX worked just fine, but also needed the task
Pod to be deleted/restarted before it would start running Jobs. Strange, but whatever for now...
CI tests are now passing, too, so I'm going to go ahead and merge to master and close out this issue. Yay!
Right now I'm building out everything using open source AWX, just for convenience's sake. But I'm working on building the operator in a way where users could choose between AWX and Tower (if they want support and a license, and all that).
See:
Docs for setup: