dell / omnia

An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
https://omnia-doc.readthedocs.io/en/latest/index.html
Apache License 2.0
219 stars 115 forks source link

Wait for mngmnt-network-container to become ready before attempting to configure #736

Closed j0hnL closed 2 years ago

j0hnL commented 2 years ago

Describe the bug building the control plane fails when it does not wait for mngmnt-network-container to become ready. The check currently gets the name, then attempts to configure. The configure will often fail since the container is not yet ready.

To Reproduce install control plane on fresh install server

Expected behavior wait until container is in ready state before attempting configuration.

j0hnL commented 2 years ago

it looks like there is a wait, but it doesn't wait long enough?


TASK [control_plane_device : Wait for mngmnt_network pod to come to ready state] *********************************************************************************
ok: [localhost]

TASK [control_plane_device : Get mngmnt_network pod name] ********************************************************************************************************
ok: [localhost]

TASK [control_plane_device : Configuring mngmnt_network container] ***********************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["kubectl", "exec", "--stdin", "--tty", "-n", "network-config", "mngmnt-network-container-596cc975c7-92ggk", "--", "ansible-playbook", "/root/mngmnt_container_configure.yml"], "delta": "0:00:03.898257", "end": "2022-01-13 15:43:36.150691", "msg": "non-zero return code", "rc": 2, "start": "2022-01-13 15:43:32.252434", "stderr": "Unable to use a TTY - input is not a terminal or the right kind of file\n[WARNING]: provided hosts list is empty, only localhost is available. Note that\nthe implicit localhost does not match 'all'\n[WARNING]: The value ['Socket'] (type list) in a string field was converted to\n\"['Socket']\" (type string). If this does not look like what you expect, quote\nthe entire value to ensure it does not change.\ncommand terminated with exit code 2", "stderr_lines": ["Unable to use a TTY - input is not a terminal or the right kind of file", "[WARNING]: provided hosts list is empty, only localhost is available. Note that", "the implicit localhost does not match 'all'", "[WARNING]: The value ['Socket'] (type list) in a string field was converted to", "\"['Socket']\" (type string). If this does not look like what you expect, quote", "the entire value to ensure it does not change.", "command terminated with exit code 2"], "stdout": "\nPLAY [Initial  setup] **********************************************************\n\nTASK [Change mode of tftpboot] *************************************************\nchanged: [localhost]\n\nTASK [Link for tftp services] **************************************************\nchanged: [localhost]\n\nTASK [Link for tftp services] **************************************************\nchanged: [localhost]\n\nTASK [Edit the tftp-server service file] ***************************************\nchanged: [localhost]\n\nTASK [Edit the tftp-server service file] ***************************************\nchanged: [localhost]\n\nTASK [Edit the tftp-server service file] ***************************************\nok: [localhost]\n\nTASK [Edit the tftp-server service file] ***************************************\nchanged: [localhost]\n\nTASK [Edit the tftp-server socket file] ****************************************\nchanged: [localhost]\n\nTASK [Start tftp services] *****************************************************\nfatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"Could not find the requested service tftp-server: host\"}\n\nPLAY RECAP *********************************************************************\nlocalhost                  : ok=8    changed=7    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   ", "stdout_lines": ["", "PLAY [Initial  setup] **********************************************************", "", "TASK [Change mode of tftpboot] *************************************************", "changed: [localhost]", "", "TASK [Link for tftp services] **************************************************", "changed: [localhost]", "", "TASK [Link for tftp services] **************************************************", "changed: [localhost]", "", "TASK [Edit the tftp-server service file] ***************************************", "changed: [localhost]", "", "TASK [Edit the tftp-server service file] ***************************************", "changed: [localhost]", "", "TASK [Edit the tftp-server service file] ***************************************", "ok: [localhost]", "", "TASK [Edit the tftp-server service file] ***************************************", "changed: [localhost]", "", "TASK [Edit the tftp-server socket file] ****************************************", "changed: [localhost]", "", "TASK [Start tftp services] *****************************************************", "fatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"Could not find the requested service tftp-server: host\"}", "", "PLAY RECAP *********************************************************************", "localhost                  : ok=8    changed=7    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   "]}

Suggestions:

Shubhangi-dell commented 2 years ago

Hi John,

image

I tried to run Omnia- 1.1.1 on Rocky OS and I am not able to reproduce it. Also, the pod is created. While configuring the pod, it is giving the error " FAILED! => {\"changed\": false, \"msg\": \"Could not find the requested service tftp-server: host\"}\n\ " . Can you try deleting the image and re-running this role once more? Also, can you provide some other information like OS details and which release branch are you using??

j0hnL commented 2 years ago

@Shubhangi-dell this error appears when we are not configuring switches by setting ethernet_switch_support: false in control_plane/input_params/base_vars.yml

Looks like we are missing some logic when that setting is false. Do we need the management network container if we are not configuring switches? is this container needed if we are not using iDRAC?

Shubhangi-dell commented 2 years ago

Solved in devel branch in Issue https://github.com/dellhpc/omnia/pull/834