ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.1k stars 3.43k forks source link

No facts cached when job slicing is greater than 1 #13863

Open krutaw opened 1 year ago

krutaw commented 1 year ago

Please confirm the following

Bug Summary

While running a job template that has slicing set to 5 and enables fact storage against an inventory that has a total of 10 hosts, the job runs successfully, however, facts are not cached. If I then set slicing back to 1 and re-run the same job, facts are then properly cached.

AWX version

22.0.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

2.14.4

Operating system

Tested with alpine and stream-8 against Windows hosts

Web browser

No response

Steps to reproduce

Expected results

Standard facts from the playbook should be cached at the host level.

Actual results

Facts were not cached, leaving the default {} in the host facts

Additional information

This is based on conversation found here: https://groups.google.com/g/awx-project/c/Cd_SgiYEfVk.

Also, I'd just like to say thank you for all the hard work you folks have put into this. It's very much appreciated. :)

fosterseth commented 1 year ago

we had some issues reproducing this

6 linux hosts, with job slicing set to 3 we were able to get facts back from each host

Can you run a job against linux boxes to if the problem exists there too? if not, this might help us narrow down the issue to windows specifically

krutaw commented 1 year ago

Absolutely, I'll get that done today and report back.

krutaw commented 1 year ago

So sorry it took me this long to get back to you. That's really interesting that you were able to get facts back on linux servers, because I wasn't. Here's the specific settings I have set on the job template (pulled from the api):

{ "name": "test", "description": "", "job_type": "run", "inventory": 49, "project": 114, "playbook": "main.yml", "scm_branch": "", "forks": 0, "limit": "", "verbosity": 0, "extra_vars": "---", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": true, "execution_environment": null, "host_config_key": "", "ask_scm_branch_on_launch": false, "ask_diff_mode_on_launch": false, "ask_variables_on_launch": false, "ask_limit_on_launch": true, "ask_tags_on_launch": false, "ask_skip_tags_on_launch": false, "ask_job_type_on_launch": false, "ask_verbosity_on_launch": false, "ask_inventory_on_launch": false, "ask_credential_on_launch": false, "ask_execution_environment_on_launch": false, "ask_labels_on_launch": false, "ask_forks_on_launch": false, "ask_job_slice_count_on_launch": false, "ask_timeout_on_launch": false, "ask_instance_groups_on_launch": false, "survey_enabled": false, "become_enabled": false, "diff_mode": false, "allow_simultaneous": false, "job_slice_count": 3, "webhook_service": "", "webhook_credential": null, "prevent_instance_group_fallback": false }

When I look at the one of the sliced jobs, I can clearly see fact caching is enabled, but no facts were cached:

{ "id": 1659, "type": "job", "url": "/api/v2/jobs/1659/", "related": { "created_by": "/api/v2/users/1/", "labels": "/api/v2/jobs/1659/labels/", "inventory": "/api/v2/inventories/49/", "project": "/api/v2/projects/114/", "organization": "/api/v2/organizations/36/", "credentials": "/api/v2/jobs/1659/credentials/", "unified_job_template": "/api/v2/job_templates/117/", "stdout": "/api/v2/jobs/1659/stdout/", "source_workflow_job": "/api/v2/workflow_jobs/1656/", "execution_environment": "/api/v2/execution_environments/36/", "job_events": "/api/v2/jobs/1659/job_events/", "job_host_summaries": "/api/v2/jobs/1659/job_host_summaries/", "activity_stream": "/api/v2/jobs/1659/activity_stream/", "notifications": "/api/v2/jobs/1659/notifications/", "create_schedule": "/api/v2/jobs/1659/create_schedule/", "job_template": "/api/v2/job_templates/117/", "cancel": "/api/v2/jobs/1659/cancel/", "relaunch": "/api/v2/jobs/1659/relaunch/" }, "summary_fields": { "organization": { "id": 36, "name": "REDACTED", "description": "Organization managed by ansible playbook REDACTED" }, "inventory": { "id": 49, "name": "vCenter - REDACTED", "description": "Inventory for vCenter Server.", "has_active_failures": true, "total_hosts": 1234, "hosts_with_active_failures": 12, "total_groups": 13, "has_inventory_sources": true, "total_inventory_sources": 1, "inventory_sources_with_failures": 0, "organization_id": 36, "kind": "" }, "execution_environment": { "id": 36, "name": "standardized_ee", "description": "", "image": "REDACTED" }, "project": { "id": 114, "name": "ap_test_facts", "description": "", "status": "successful", "scm_type": "git", "allow_override": false }, "job_template": { "id": 117, "name": "test", "description": "" }, "unified_job_template": { "id": 117, "name": "test", "description": "", "unified_job_type": "job" }, "instance_group": { "id": 35, "name": "REDACTED_exec", "is_container_group": false }, "created_by": { "id": 1, "username": "admin", "first_name": "", "last_name": "" }, "user_capabilities": { "delete": true, "start": true }, "labels": { "count": 0, "results": [] }, "source_workflow_job": { "id": 1656, "name": "test", "description": "", "status": "failed", "failed": true, "elapsed": 20.426 }, "ancestor_job": { "id": 1656, "name": "test", "type": "workflow_job", "url": "/api/v2/workflow_jobs/1656/" }, "credentials": [ { "id": 59, "name": "REDACTED_Linux_awx_REDACTED", "description": "", "kind": "ssh", "cloud": false } ] }, "created": "2023-04-21T14:37:35.670429Z", "modified": "2023-04-21T14:37:36.649589Z", "name": "test", "description": "", "job_type": "run", "inventory": 49, "project": 114, "playbook": "main.yml", "scm_branch": "", "forks": 0, "limit": "oldawx", "verbosity": 0, "extra_vars": "{}", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": true, "organization": 36, "unified_job_template": 117, "launch_type": "workflow", "status": "failed", "execution_environment": 36, "failed": true, "started": "2023-04-21T14:37:37.174105Z", "finished": "2023-04-21T14:37:55.095867Z", "canceled_on": null, "elapsed": 17.922, "job_args": "[\"podman\", \"run\", \"--rm\", \"--tty\", \"--interactive\", \"--workdir\", \"/runner/project\", \"-v\", \"/tmp/awx_1659_b59k2crh/:/runner/:Z\", \"-v\", \"/etc/pki/ca-trust/:/etc/pki/ca-trust/:O\", \"-v\", \"/usr/share/pki/:/usr/share/pki/:O\", \"--env-file\", \"/tmp/awx_1659_b59k2crh/artifacts/1659/env.list\", \"--quiet\", \"--name\", \"ansible_runner_1659\", \"--user=root\", \"--network\", \"slirp4netns:enable_ipv6=true\", \"--pull=always\", \"REDACTED\", \"ansible-playbook\", \"-u\", \"REDACTED\", \"--ask-pass\", \"--become-method\", \"sudo\", \"--ask-become-pass\", \"-l\", \"oldawx\", \"-i\", \"/runner/inventory/hosts\", \"-e\", \"@/runner/env/extravars\", \"main.yml\"]", "job_cwd": "/runner/project", "job_env": { "ANSIBLE_UNSAFE_WRITES": "1", "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/1659", "ANSIBLE_CACHE_PLUGIN_CONNECTION": "/runner/artifacts/1659/fact_cache", "ANSIBLE_FORCE_COLOR": "True", "ANSIBLE_HOST_KEY_CHECKING": "False", "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True", "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False", "AWX_PRIVATE_DATA_DIR": "/tmp/awx_1659_b59k2crh", "JOB_ID": "1659", "INVENTORY_ID": "49", "PROJECT_REVISION": "e892b7fcdd50f5fefbb5cb1020619898de9c9909", "ANSIBLE_RETRY_FILES_ENABLED": "False", "MAX_EVENT_RES": "700000", "AWX_HOST": "REDACTED", "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp", "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles", "ANSIBLE_CALLBACK_PLUGINS": "/runner/artifacts/1659/callback", "ANSIBLE_STDOUT_CALLBACK": "awx_display", "ANSIBLE_CACHE_PLUGIN": "jsonfile", "RUNNER_OMIT_EVENTS": "False", "RUNNER_ONLY_FAILED_EVENTS": "False" }, "job_explanation": "", "execution_node": "REDACTED", "controller_node": "awx-k8s-task-64498697c6-jnxqx", "result_traceback": "", "event_processing_finished": true, "launched_by": { "id": 1, "name": "admin", "type": "user", "url": "/api/v2/users/1/" }, "work_unit_id": "3TP5Gd0Z", "job_template": 117, "passwords_needed_to_start": [], "allow_simultaneous": true, "artifacts": {}, "scm_revision": "e892b7fcdd50f5fefbb5cb1020619898de9c9909", "instance_group": 35, "diff_mode": false, "job_slice_number": 3, "job_slice_count": 3, "webhook_service": "", "webhook_credential": null, "webhook_guid": "", "host_status_counts": { "dark": 1 }, "playbook_counts": { "play_count": 1, "task_count": 1 }, "custom_virtualenv": null }

Am I missing something?

krutaw commented 1 year ago

Okay, here's neat. My original test included three linux hosts, with the slicing set to three. On a whim, I tried setting slicing to two, and running again... voila I got facts. Then, I deleted the hosts, recreated them, set slicing back to three and once again got no facts cached. Then, I tried setting slicing to four (with three hosts -so obviously one of them will get no hosts) and once again, I got no facts. As time permits, I'll try some other scenarios as well.

djyasin commented 1 year ago

@krutaw Thank you for providing this additional information. We will need to spend some time on this issue. If you have any other tips regarding the reproduction of that issue it would be very appreciated.

krutaw commented 1 year ago

@djyasin Understood. I'm hoping to have time today to run a few more tests and will report on the various permutations and results.

AlanCoding commented 1 year ago

I set up an inventory with 7 hosts, and then tried slicing from 2, 3, 4, 5, ... and I wasn't able to find any case where the facts were not saved as expected. I'm still interested if you have any further leads, but I can't get anywhere with the current information to reproduce.

krutaw commented 1 year ago

We recently upgraded to 22.1.0 and with that I tested again. I set the number of devices to 7 as per your test. I then set the slicing to 5, and voila it worked, I had facts cached at the host level in AWX. I was ready to call it but thought, gosh, I should probably test one more time and set the slicing to 6, deleting the hosts from AWX and then re-running the job template. No facts. Then I deleted the hosts and tried the exact same test again with slicing set to 6. Again, no facts. Then I deleted the hosts and tried again with slicing set to 5, voila, facts. Then, I deleted the hosts again and changed slicing to 7 and ran: No facts.

So, then I repeated the steps for slicing set to 4,3, and 2 and here are the results:

4: No facts 3: Facts populated - I repeated this test after deleting and facts populated again 2: No facts - I relaunched without deleting and still no facts

After the failure with slicing at 2, I moved it back up to 4 and again, no facts. Noticing the pattern, I set the slicing to 5, and voila the facts were once again cached at the host level in AWX.

If it would help, I'd be happy to do a screen share and show you exactly what I'm doing and the behaviors involved.

AlanCoding commented 1 year ago

and set the slicing to 6, deleting the hosts from AWX and then re-running the job template.

So far, you have an inventory with 7 hosts (created manually, I assume). You ran your JT against that inventory with slice count of 5, and all 7 hosts have facts. Then you deleted your 7 hosts, changed the slice count to 6, and... at this point did you re-create the 7 hosts? Then relaunch that same job?

This is close to my steps, but with seemly minor differences around what is and isn't re-created. If you could spell this part out a little better then I'll retry my tests to better reflect yours.

krutaw commented 1 year ago

I'm not using a manually created inventory. I am using an inventory sync'd from VMware. When I deleted the hosts in between jt runs, I would perform a re-sync of the VMware vCenter in order to re-create the hosts. I did not re-launch the same job, but would instead click through from the jt manually just in case it would have cached the slicing (I wasn't sure so I was trying to be overly cautious.)

Oh, and all of the hosts are linux servers - I'm no longer testing against windows servers.

krutaw commented 1 year ago

FWIW, we just upgraded to 22.2.0 and are still facing the issue.

krutaw commented 1 year ago

Has no one been able to duplicate the behavior?

AlanCoding commented 1 year ago

I deleted the hosts in between jt runs, I would perform a re-sync of the VMware vCenter in order to re-create the hosts.

Would it clear it up to note that, when a host is deleted, any previously cached facts are lost? We cache facts in relation to the host model, not the host name (which you may expect from the CLI).

krutaw commented 1 year ago

Doesn't clear up a thing. I deleted the hosts trying to prevent any causes that would prevent the facts from being populated (i.e. to just perform a clean test.). The fact still remains that the facts aren't being populated.

krutaw commented 1 year ago

Just circling back, I have confirmed this issue still exists in 22.3.0.

krutaw commented 1 year ago

Just verified that the same problem still exists in 22.4.0

krutaw commented 1 year ago

Just verified the same issue still exists in 22.5.0.

krutaw commented 1 year ago

Any updates?

AlanCoding commented 1 year ago

I'm sorry, I was never able to reproduce this. I believe there is a bug here and would be interested in it, but I can not think of anything to try which would make a difference and is different from what I already tried. A common issue is that out-of-sync time settings on the control node vs. execution node can cause facts to be lost, which might be worth looking into? Kind of grasping at straws.

krutaw commented 1 year ago

I'll double check, but I'm relatively certain that we've already checked that.

AlanCoding commented 1 year ago

Maybe this is a case where community-facing integration testing could help. We want to work towards having that. Right now we have a "dev-env" check, and this spins up the docker-compose environment. I'd love to get to the point where I can actually submit the test that I was trying to use for reproducing, and that could give us a better chance at pinning down the issue.

krutaw commented 1 year ago

Interesting thought there, I think half of it would be configuration of the various things like credentials, vCenters, etc. Cool idea though.