Open krutaw opened 1 year ago
we had some issues reproducing this
6 linux hosts, with job slicing set to 3 we were able to get facts back from each host
Can you run a job against linux boxes to if the problem exists there too? if not, this might help us narrow down the issue to windows specifically
Absolutely, I'll get that done today and report back.
So sorry it took me this long to get back to you. That's really interesting that you were able to get facts back on linux servers, because I wasn't. Here's the specific settings I have set on the job template (pulled from the api):
{ "name": "test", "description": "", "job_type": "run", "inventory": 49, "project": 114, "playbook": "main.yml", "scm_branch": "", "forks": 0, "limit": "", "verbosity": 0, "extra_vars": "---", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": true, "execution_environment": null, "host_config_key": "", "ask_scm_branch_on_launch": false, "ask_diff_mode_on_launch": false, "ask_variables_on_launch": false, "ask_limit_on_launch": true, "ask_tags_on_launch": false, "ask_skip_tags_on_launch": false, "ask_job_type_on_launch": false, "ask_verbosity_on_launch": false, "ask_inventory_on_launch": false, "ask_credential_on_launch": false, "ask_execution_environment_on_launch": false, "ask_labels_on_launch": false, "ask_forks_on_launch": false, "ask_job_slice_count_on_launch": false, "ask_timeout_on_launch": false, "ask_instance_groups_on_launch": false, "survey_enabled": false, "become_enabled": false, "diff_mode": false, "allow_simultaneous": false, "job_slice_count": 3, "webhook_service": "", "webhook_credential": null, "prevent_instance_group_fallback": false }
When I look at the one of the sliced jobs, I can clearly see fact caching is enabled, but no facts were cached:
{ "id": 1659, "type": "job", "url": "/api/v2/jobs/1659/", "related": { "created_by": "/api/v2/users/1/", "labels": "/api/v2/jobs/1659/labels/", "inventory": "/api/v2/inventories/49/", "project": "/api/v2/projects/114/", "organization": "/api/v2/organizations/36/", "credentials": "/api/v2/jobs/1659/credentials/", "unified_job_template": "/api/v2/job_templates/117/", "stdout": "/api/v2/jobs/1659/stdout/", "source_workflow_job": "/api/v2/workflow_jobs/1656/", "execution_environment": "/api/v2/execution_environments/36/", "job_events": "/api/v2/jobs/1659/job_events/", "job_host_summaries": "/api/v2/jobs/1659/job_host_summaries/", "activity_stream": "/api/v2/jobs/1659/activity_stream/", "notifications": "/api/v2/jobs/1659/notifications/", "create_schedule": "/api/v2/jobs/1659/create_schedule/", "job_template": "/api/v2/job_templates/117/", "cancel": "/api/v2/jobs/1659/cancel/", "relaunch": "/api/v2/jobs/1659/relaunch/" }, "summary_fields": { "organization": { "id": 36, "name": "REDACTED", "description": "Organization managed by ansible playbook REDACTED" }, "inventory": { "id": 49, "name": "vCenter - REDACTED", "description": "Inventory for vCenter Server.", "has_active_failures": true, "total_hosts": 1234, "hosts_with_active_failures": 12, "total_groups": 13, "has_inventory_sources": true, "total_inventory_sources": 1, "inventory_sources_with_failures": 0, "organization_id": 36, "kind": "" }, "execution_environment": { "id": 36, "name": "standardized_ee", "description": "", "image": "REDACTED" }, "project": { "id": 114, "name": "ap_test_facts", "description": "", "status": "successful", "scm_type": "git", "allow_override": false }, "job_template": { "id": 117, "name": "test", "description": "" }, "unified_job_template": { "id": 117, "name": "test", "description": "", "unified_job_type": "job" }, "instance_group": { "id": 35, "name": "REDACTED_exec", "is_container_group": false }, "created_by": { "id": 1, "username": "admin", "first_name": "", "last_name": "" }, "user_capabilities": { "delete": true, "start": true }, "labels": { "count": 0, "results": [] }, "source_workflow_job": { "id": 1656, "name": "test", "description": "", "status": "failed", "failed": true, "elapsed": 20.426 }, "ancestor_job": { "id": 1656, "name": "test", "type": "workflow_job", "url": "/api/v2/workflow_jobs/1656/" }, "credentials": [ { "id": 59, "name": "REDACTED_Linux_awx_REDACTED", "description": "", "kind": "ssh", "cloud": false } ] }, "created": "2023-04-21T14:37:35.670429Z", "modified": "2023-04-21T14:37:36.649589Z", "name": "test", "description": "", "job_type": "run", "inventory": 49, "project": 114, "playbook": "main.yml", "scm_branch": "", "forks": 0, "limit": "oldawx", "verbosity": 0, "extra_vars": "{}", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": true, "organization": 36, "unified_job_template": 117, "launch_type": "workflow", "status": "failed", "execution_environment": 36, "failed": true, "started": "2023-04-21T14:37:37.174105Z", "finished": "2023-04-21T14:37:55.095867Z", "canceled_on": null, "elapsed": 17.922, "job_args": "[\"podman\", \"run\", \"--rm\", \"--tty\", \"--interactive\", \"--workdir\", \"/runner/project\", \"-v\", \"/tmp/awx_1659_b59k2crh/:/runner/:Z\", \"-v\", \"/etc/pki/ca-trust/:/etc/pki/ca-trust/:O\", \"-v\", \"/usr/share/pki/:/usr/share/pki/:O\", \"--env-file\", \"/tmp/awx_1659_b59k2crh/artifacts/1659/env.list\", \"--quiet\", \"--name\", \"ansible_runner_1659\", \"--user=root\", \"--network\", \"slirp4netns:enable_ipv6=true\", \"--pull=always\", \"REDACTED\", \"ansible-playbook\", \"-u\", \"REDACTED\", \"--ask-pass\", \"--become-method\", \"sudo\", \"--ask-become-pass\", \"-l\", \"oldawx\", \"-i\", \"/runner/inventory/hosts\", \"-e\", \"@/runner/env/extravars\", \"main.yml\"]", "job_cwd": "/runner/project", "job_env": { "ANSIBLE_UNSAFE_WRITES": "1", "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/1659", "ANSIBLE_CACHE_PLUGIN_CONNECTION": "/runner/artifacts/1659/fact_cache", "ANSIBLE_FORCE_COLOR": "True", "ANSIBLE_HOST_KEY_CHECKING": "False", "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True", "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False", "AWX_PRIVATE_DATA_DIR": "/tmp/awx_1659_b59k2crh", "JOB_ID": "1659", "INVENTORY_ID": "49", "PROJECT_REVISION": "e892b7fcdd50f5fefbb5cb1020619898de9c9909", "ANSIBLE_RETRY_FILES_ENABLED": "False", "MAX_EVENT_RES": "700000", "AWX_HOST": "REDACTED", "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp", "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles", "ANSIBLE_CALLBACK_PLUGINS": "/runner/artifacts/1659/callback", "ANSIBLE_STDOUT_CALLBACK": "awx_display", "ANSIBLE_CACHE_PLUGIN": "jsonfile", "RUNNER_OMIT_EVENTS": "False", "RUNNER_ONLY_FAILED_EVENTS": "False" }, "job_explanation": "", "execution_node": "REDACTED", "controller_node": "awx-k8s-task-64498697c6-jnxqx", "result_traceback": "", "event_processing_finished": true, "launched_by": { "id": 1, "name": "admin", "type": "user", "url": "/api/v2/users/1/" }, "work_unit_id": "3TP5Gd0Z", "job_template": 117, "passwords_needed_to_start": [], "allow_simultaneous": true, "artifacts": {}, "scm_revision": "e892b7fcdd50f5fefbb5cb1020619898de9c9909", "instance_group": 35, "diff_mode": false, "job_slice_number": 3, "job_slice_count": 3, "webhook_service": "", "webhook_credential": null, "webhook_guid": "", "host_status_counts": { "dark": 1 }, "playbook_counts": { "play_count": 1, "task_count": 1 }, "custom_virtualenv": null }
Am I missing something?
Okay, here's neat. My original test included three linux hosts, with the slicing set to three. On a whim, I tried setting slicing to two, and running again... voila I got facts. Then, I deleted the hosts, recreated them, set slicing back to three and once again got no facts cached. Then, I tried setting slicing to four (with three hosts -so obviously one of them will get no hosts) and once again, I got no facts. As time permits, I'll try some other scenarios as well.
@krutaw Thank you for providing this additional information. We will need to spend some time on this issue. If you have any other tips regarding the reproduction of that issue it would be very appreciated.
@djyasin Understood. I'm hoping to have time today to run a few more tests and will report on the various permutations and results.
I set up an inventory with 7 hosts, and then tried slicing from 2, 3, 4, 5, ... and I wasn't able to find any case where the facts were not saved as expected. I'm still interested if you have any further leads, but I can't get anywhere with the current information to reproduce.
We recently upgraded to 22.1.0 and with that I tested again. I set the number of devices to 7 as per your test. I then set the slicing to 5, and voila it worked, I had facts cached at the host level in AWX. I was ready to call it but thought, gosh, I should probably test one more time and set the slicing to 6, deleting the hosts from AWX and then re-running the job template. No facts. Then I deleted the hosts and tried the exact same test again with slicing set to 6. Again, no facts. Then I deleted the hosts and tried again with slicing set to 5, voila, facts. Then, I deleted the hosts again and changed slicing to 7 and ran: No facts.
So, then I repeated the steps for slicing set to 4,3, and 2 and here are the results:
4: No facts 3: Facts populated - I repeated this test after deleting and facts populated again 2: No facts - I relaunched without deleting and still no facts
After the failure with slicing at 2, I moved it back up to 4 and again, no facts. Noticing the pattern, I set the slicing to 5, and voila the facts were once again cached at the host level in AWX.
If it would help, I'd be happy to do a screen share and show you exactly what I'm doing and the behaviors involved.
and set the slicing to 6, deleting the hosts from AWX and then re-running the job template.
So far, you have an inventory with 7 hosts (created manually, I assume). You ran your JT against that inventory with slice count of 5, and all 7 hosts have facts. Then you deleted your 7 hosts, changed the slice count to 6, and... at this point did you re-create the 7 hosts? Then relaunch that same job?
This is close to my steps, but with seemly minor differences around what is and isn't re-created. If you could spell this part out a little better then I'll retry my tests to better reflect yours.
I'm not using a manually created inventory. I am using an inventory sync'd from VMware. When I deleted the hosts in between jt runs, I would perform a re-sync of the VMware vCenter in order to re-create the hosts. I did not re-launch the same job, but would instead click through from the jt manually just in case it would have cached the slicing (I wasn't sure so I was trying to be overly cautious.)
Oh, and all of the hosts are linux servers - I'm no longer testing against windows servers.
FWIW, we just upgraded to 22.2.0 and are still facing the issue.
Has no one been able to duplicate the behavior?
I deleted the hosts in between jt runs, I would perform a re-sync of the VMware vCenter in order to re-create the hosts.
Would it clear it up to note that, when a host is deleted, any previously cached facts are lost? We cache facts in relation to the host model, not the host name (which you may expect from the CLI).
Doesn't clear up a thing. I deleted the hosts trying to prevent any causes that would prevent the facts from being populated (i.e. to just perform a clean test.). The fact still remains that the facts aren't being populated.
Just circling back, I have confirmed this issue still exists in 22.3.0.
Just verified that the same problem still exists in 22.4.0
Just verified the same issue still exists in 22.5.0.
Any updates?
I'm sorry, I was never able to reproduce this. I believe there is a bug here and would be interested in it, but I can not think of anything to try which would make a difference and is different from what I already tried. A common issue is that out-of-sync time settings on the control node vs. execution node can cause facts to be lost, which might be worth looking into? Kind of grasping at straws.
I'll double check, but I'm relatively certain that we've already checked that.
Maybe this is a case where community-facing integration testing could help. We want to work towards having that. Right now we have a "dev-env" check, and this spins up the docker-compose environment. I'd love to get to the point where I can actually submit the test that I was trying to use for reproducing, and that could give us a better chance at pinning down the issue.
Interesting thought there, I think half of it would be configuration of the various things like credentials, vCenters, etc. Cool idea though.
Please confirm the following
Bug Summary
While running a job template that has slicing set to 5 and enables fact storage against an inventory that has a total of 10 hosts, the job runs successfully, however, facts are not cached. If I then set slicing back to 1 and re-run the same job, facts are then properly cached.
AWX version
22.0.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
2.14.4
Operating system
Tested with alpine and stream-8 against Windows hosts
Web browser
No response
Steps to reproduce
Expected results
Standard facts from the playbook should be cached at the host level.
Actual results
Facts were not cached, leaving the default {} in the host facts
Additional information
This is based on conversation found here: https://groups.google.com/g/awx-project/c/Cd_SgiYEfVk.
Also, I'd just like to say thank you for all the hard work you folks have put into this. It's very much appreciated. :)