clustervision / trinityX

TrinityX is the new generation of ClusterVision's open-source HPC, A/I and cloudbursting platform. It is designed from the ground up to provide all services required in a modern HPC and A/I system, and to allow full customization of the installation.
GNU General Public License v3.0
67 stars 37 forks source link

Image creation fails #401

Open javree opened 10 months ago

javree commented 10 months ago

Following the install guide at https://docs.clustervision.com/install/install/ on Rocky Linux 8.9 Controller install went fine, ansible finished without issues However image creation fails :

TASK [init : Install init packages] **** failed: [compute.osimages.luna] (item=python3-libselinux) => {"ansible_loop_var": "item", "changed": false, "item": "python3-libselinux", "msg": "Could not import the dnf python module using /usr/libexec/platform-python (3.6.8 (default, Jan 15 2024, 23:09:02) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]). Please install python3-dnf or python2-dnf package or ensure you have specified the correct ansible_python_interpreter. (attempted ['/usr/libexec/platform-python', '/usr/bin/python3', '/usr/bin/python2', '/usr/bin/python'])", "results": []}

PLAY RECAP ***** compute.osimages.luna : ok=3 changed=0 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
controller1 : ok=52 changed=5 unreachable=0 failed=0 skipped=34 rescued=0 ignored=0

[root@marclus0 site]# cat /etc/redhat-release Rocky Linux release 8.9 (Green Obsidian)

[root@marclus0 site]# rpm -qa | grep -i ansible ansible-8.3.0-1.el8.noarch ansible-core-2.15.3-1.el8.x86_64

We've not edited anything in the playbook

msteggink commented 10 months ago

I assuem you just executed ansible-playbook compute-redhat.yml ? I just reinstalled Rocky 8.9 and rolled out controller.yml and compute-redhat.yml and they work here.

PLAY RECAP ****************************************************************************************************************************************************************
compute.osimages.luna      : ok=124  changed=74   unreachable=0    failed=0    skipped=136  rescued=0    ignored=1
controller1                : ok=55   changed=20   unreachable=0    failed=0    skipped=33   rescued=0    ignored=0

Do you have python3-dnf packages installed, e.g.:

rpm -qa | grep python3-dnf
python3-dnf-plugin-versionlock-4.0.21-23.el8.noarch
python3-dnf-4.7.0-19.el8.noarch
python3-dnf-plugins-core-4.0.21-23.el8.noarch
javree commented 10 months ago

I have exactly those RPM's installed : python3-dnf-plugin-versionlock-4.0.21-23.el8.noarch python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch

Indeed executed that exact command. I've just ran ansible-playbook -vvvv compute-redhat.yml >> log.txt 2>&1 and attached it's output , as well as a full rpm list log.txt rpmlist.txt

msteggink commented 10 months ago

Hi @javree , for some reason it picked up 3.6 in the image. The python36 is pulled in by gdm, OpenHPC and OOD. Was this your second ansible run? Did you do anything with your environment (env/set)?

Can you do

ansible --version | grep python

Can you also retry the run by adding the following to ansible.cfg?

interpreter_python=/usr/bin/python3.11
msteggink commented 10 months ago

@javree did the line fixed it for the compute-redhat.yml ?

javree commented 10 months ago

Unfortunately no ; since you mentioned it might have something to do with running the playbook multiple times, I am underway fully redeploying the controller and start fresh.

xdkreij commented 10 months ago

Unfortunately no ; since you mentioned it might have something to do with running the playbook multiple times, I am underway fully redeploying the controller and start fresh.

So what you're basically telling us is that the Ansible playbooks are not idempotent?

I wonder if starting from fresh solved the issue 🤔

javree commented 10 months ago

Will report next week, how a fresh install went

javree commented 9 months ago

Sorry for the delay in getting back. Did a full reinstall of the controller from a Rocky 8.9 USB key , ran through the procedure again and changed nothing else. Yet again exactly the same issue ... I have not touched the compute-redhat.yml file in any way

Note: On a default Rocky8 install python3.6 is the system default. Ansible on Rocky8 now uses python3.11 but for python3.11 there is no python3.11-dnf package so adding python3.11 will break things elsewhere... I'm seriously wondering how this can work at all

javree commented 9 months ago

Just for giggles I tried the compute-ubuntu playbook and that completed fine, so I can at least boot a node soon hopefully... But the issue regarding ansible using python 3.11 vs dnf using python 3.6 remains when trying to build a RHEL image

javree commented 9 months ago

I've tried this again, but this time using Rocky Linux 9.3 on the controller and there all appears to work just fine.

msteggink commented 9 months ago

@javree thank you for your feedback! I think for the 8.x we had a fix but I'll need to double check that.

aphmschonewille commented 8 months ago

There have been quite a few changes in how we prepare (install) the ansible environment before running the playbook. Though these have not been pushed to github yet, i expect (hope) that these issues will belong to the past. Our target for pushing is in about 2-3 weeks from today. We are finalizing the new monitoring stack and H/A.

aphmschonewille commented 7 months ago

Latest greatest has been pushed.

javree commented 7 months ago

Very happy to report that with the new release all is well on Rocky 8 as well !

javree commented 5 months ago

Hate to reopen this ...

Did a fresh checkout, machine fully up to date Rocky 8.9 Running

marclus0 18:43:54 [root@marclus0 site]# ansible-playbook compute-redhat.yml

Gives me

TASK [trix-tree : Create Trinity H/A directory structure on controllers] **** skipping: [compute.osimages.luna]

TASK [init : Install init packages] ***** failed: [compute.osimages.luna] (item=python3-libselinux) => {"ansible_loop_var": "item", "changed": false, "item": "python3-libselinux", "msg": "Could not import the dnf python module using /usr/libexec/platform-python (3.6.8 (default, Apr 24 2024, 21:55:04) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]). Please install python3-dnf or python2-dnf package or ensure you have specified the correct ansible_python_interpreter. (attempted ['/usr/libexec/platform-python', '/usr/bin/python3', '/usr/bin/python2', '/usr/bin/python'])", "results": []}

PLAY RECAP ** compute.osimages.luna : ok=3 changed=1 unreachable=0 failed=1 skipped=2 rescued=0 ignored=0 controller1 : ok=59 changed=20 unreachable=0 failed=0 skipped=43 rescued=0 ignored=0

marclus0 18:55:33 [root@marclus0 site]#

Again the conflict between python 3.6 (system default) and the ansible python 3.11

aphmschonewille commented 5 months ago

... one thing truly amazes me every time how something, supposedly be 'generic' like a Rocky install (or redhat, or alma, or...) can be so much different anywhere in the world.... I'll get back to you as rocky 8.10 (which i've done more than 10 installs today alone), all work as expected. Not sure if rocky 8.9 is now deviating? Last week 8.9 was also just fine... It truly amazes me..... -A

xdkreij commented 5 months ago

a hint - as I encountered the same issue today. Verify your subscription within the image.

Also, within the image, run a watch -n 0.1 "cat /etc/yum.repos.d/redhat.repo" (Using Red Hat instead)

What happens within rhel 8.10 is in regards to the default baseurl within /etc/rhsm/rhsm.conf Somewhere down the line it starts to redirect to cdn.redhat.com within the redhat.repo instead of our own satellite server.

Changed the rhsm.conf baseurl to our own satellite server, and it 'stopped' changing to cnd.redhat.com resulting in installing the correct packages.

To test - try installing python3-libselinux manually within the the image, before and after 'fixing' the subscription

note: redhat.repo is configured correctly at task OK at TASK [trinity/image-create : Install redhat-release package in /trinity/images/compute] *******************************************************************************

But right before/during installing the external RPM packages tasks, redhat.repo gets 'overruled' by rhsm.conf

Second: please use python_interpreter=/usr/libexec/platform-python within ansible.cfg It resolves allot of issues with red hat at least. Including this issue (+ the above solution);

last note: the controller has a different range of supported Python interpreters than the targets and That's why you will also have problems on rhel8 if you use Ansible 2.17

(I've used ansible 2.15.x on the controller instead)