geerlingguy / tower-operator

DEPRECATED: This project was moved and renamed to: https://github.com/ansible/awx-operator
82 stars 34 forks source link

Migration failure due to pods stopping #21

Closed rust84 closed 4 years ago

rust84 commented 4 years ago

I tried to upgrade today from awx 9.1.1 to 9.2.0 and encountered an error during the migration. It looks as though it tried to run the migration on the previous container being replaced.

I was actually just testing the upgrade earlier in the day and it was successful, so it looks to be timing related though doesn't happen every run. A longer delay may be needed to ensure that the new container has finished creating before trying the migration.

I see that there is already a 5 second delay here so it may need to be longer, or better still could we configure it through the operator config?

- name: Get the Tower pod information.
  # TODO: Change to k8s_info after Ansible 2.9.0 is available in Operator image.
  k8s_facts:
    kind: Pod
    namespace: '{{ meta.namespace }}'
    label_selectors:
      - app=tower
  register: tower_pods
  until: "tower_pods['resources'][0]['status']['phase'] == 'Running'"
  delay: 5
  retries: 60

Operator output:

--------------------------- Ansible Task StdOut -------------------------------

 TASK [Migrate the database if the K8s resources were updated.] ******************************** 
fatal: [localhost]: FAILED! => {
    "changed": true,
    "cmd": "kubectl exec -n awx awx-tower-tower-web-89c99cb89-6lxgl -- bash -c \"awx-manage migrate --noinput\"",
    "delta": "0:00:00.127667",
    "end": "2020-02-12 14:11:24.460539",
    "invocation": {
        "module_args": {
            "_raw_params": "kubectl exec -n awx awx-tower-tower-web-89c99cb89-6lxgl -- bash -c \"awx-manage migrate --noinput\"",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2020-02-12 14:11:24.332872",
    "stderr": "error: unable to upgrade connection: container not found (\"tower\")",
    "stderr_lines": [
        "error: unable to upgrade connection: container not found (\"tower\")"
    ],
    "stdout": "",
    "stdout_lines": []
}

I was able to execute the migration against the new container:

kubectl exec -n awx awx-tower-tower-web-797cd6487f-dc2vh -- bash -c "awx-manage migrate --noinput"
Operations to perform:
  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit
Running migrations:
  Applying main.0102_v370_unifiedjob_canceled... OK
  Applying main.0103_v370_remove_computed_fields... OK
  Applying main.0104_v370_cleanup_old_scan_jts... OK
  Applying main.0105_v370_remove_jobevent_parent_and_hosts... OK
  Applying main.0106_v370_remove_inventory_groups_with_active_failures... OK
  Applying main.0107_v370_workflow_convergence_api_toggle... OK
  Applying main.0108_v370_unifiedjob_dependencies_processed... OK

Events:

24m         Normal    ScalingReplicaSet   deployment/awx-tower-tower-task              Scaled down replica set awx-tower-tower-task-5c4799bdf to 0
25m         Normal    Scheduled           pod/awx-tower-tower-web-797cd6487f-dc2vh     Successfully assigned awx/awx-tower-tower-web-797cd6487f-dc2vh to ip-10-16-2-184.eu-west-1.compute.internal
25m         Normal    Pulling             pod/awx-tower-tower-web-797cd6487f-dc2vh     Pulling image "ansible/awx_web:9.2.0"
24m         Normal    Pulled              pod/awx-tower-tower-web-797cd6487f-dc2vh     Successfully pulled image "ansible/awx_web:9.2.0"
24m         Normal    Created             pod/awx-tower-tower-web-797cd6487f-dc2vh     Created container tower
24m         Normal    Started             pod/awx-tower-tower-web-797cd6487f-dc2vh     Started container tower
25m         Normal    SuccessfulCreate    replicaset/awx-tower-tower-web-797cd6487f    Created pod: awx-tower-tower-web-797cd6487f-dc2vh
24m         Normal    Killing             pod/awx-tower-tower-web-89c99cb89-6lxgl      Stopping container tower
24m         Normal    SuccessfulDelete    replicaset/awx-tower-tower-web-89c99cb89     Deleted pod: awx-tower-tower-web-89c99cb89-6lxgl
25m         Normal    ScalingReplicaSet   deployment/awx-tower-tower-web               Scaled up replica set awx-tower-tower-web-797cd6487f to 1
24m         Normal    ScalingReplicaSet   deployment/awx-tower-tower-web               Scaled down replica set awx-tower-tower-web-89c99cb89 to 0
geerlingguy commented 4 years ago

Interesting, and yeah, seems like the delay may need to be longer—or some other method to make sure the new version is running before doing it. The pull can take some time, especially if you don't have a very large pipe to download the image. And then the stopping of the previous pod seems to be quite variable, from seconds to a minute or two sometimes.

stale[bot] commented 4 years ago

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

Please read this blog post to see the reasons why I mark issues as stale.

stale[bot] commented 4 years ago

This issue has been closed due to inactivity. If you feel this is in error, please reopen the issue or file a new issue with the relevant details.