clustervision / trinityX

TrinityX is the new generation of ClusterVision's open-source HPC, A/I and cloudbursting platform. It is designed from the ground up to provide all services required in a modern HPC and A/I system, and to allow full customization of the installation.
GNU General Public License v3.0
52 stars 36 forks source link

rhel 8.10 - compute-redhat.yml breaks image build #419

Open xdkreij opened 5 days ago

xdkreij commented 5 days ago

Problem description During iPXE boot, the following challenge pops up. Maybe someone has encountered this before in the past?

image

Command used ansible-playbook compute-redhat.yml -v

Expected results A working image that boots successfully :-)

xdkreij commented 5 days ago

A dump of found 'issues'

000000 04:59:47 [root@cpu site]# luna osimage kernel compute Traceback (most recent call last): File "/bin/luna", line 7, in <module> CLI = Cli().main() File "/trinity/local/python/lib/python3.10/site-packages/luna/cli.py", line 109, in main self.call_class() File "/trinity/local/python/lib/python3.10/site-packages/luna/cli.py", line 137, in call_class call(self.args, self.parser, self.subparsers) File "/trinity/local/python/lib/python3.10/site-packages/luna/osimage.py", line 67, in __init__ call(self) File "/trinity/local/python/lib/python3.10/site-packages/luna/osimage.py", line 299, in kernel_osimage http_response = result.json() AttributeError: 'types.SimpleNamespace' object has no attribute 'json'

Adding a print statement to python like so print(result.content) results in

{'message': 'osimage pack for compute already queued', 'request_id': '1719565192.3494275247818686'}

aphmschonewille commented 1 day ago

"osimage pack for compute already queued" normally means that another packing for that image was already in progress. It prevents it from being packed twice at the same time. However if changes were made while the other packing was already in progress, things will go wrong. Was there only one packing active at that time, or were there concurrent operations going or something else?

xdkreij commented 1 day ago

"osimage pack for compute already queued" normally means that another packing for that image was already in progress. It prevents it from being packed twice at the same time. However if changes were made while the other packing was already in progress, things will go wrong. Was there only one packing active at that time, or were there concurrent operations going or something else?

Only one - via the compute-redhat.yml :-)

I wonder if this would result in the 'kernel panic' eventually. The playbook seems/completes successful but apparently something goes terribly wrong with the image (build?) itself.

(side note: I do have to fix rhsm.conf half way through the play within the image itself, otherwise the redhat.repo gets overwritten and redirects to cdn.redhat.com instead - but i doubt that it would result in image issues itself since afterwards everything kicks of fine.)

xdkreij commented 1 day ago

w00000t.... i think i may have solved it...

image

What i did was posted here: https://www.linuxquestions.org/questions/linux-server-73/centos-7-does-not-boot-4175619015/

Like so...

cp /sbin/init /trinity/images/compute/sbin/init
cp /lib/systemd/systemd /trinity/images/compute/lib/systemd/systemd

 luna osimage pack compute
 luna node change -o compute node001
 --- reboot node ---

I've got no clue whatsoever why it doesn't work without.. but I'll test the compute-redhat.yml again with a new image soon to verify if this actually solved it.

aphmschonewille commented 4 hours ago

There were two problems that you hit. There was indeed a bug in the cli where a returned call caused the python trace. That has been solved and will be released soon. The other problem you see, the missing of /sbin/init is something i cannot really explain yet. May i ask when you cloned the TrinityX repo? this helps us determining if this is an ongoing problem or something that has already been solved through other fixes.

xdkreij commented 2 hours ago

There were two problems that you hit. There was indeed a bug in the cli where a returned call caused the python trace. That has been solved and will be released soon. The other problem you see, the missing of /sbin/init is something i cannot really explain yet. May i ask when you cloned the TrinityX repo? this helps us determining if this is an ongoing problem or something that has already been solved through other fixes.

The repo has been cloned (lucky for me I keep track of things using ARA) on the 24th during the re-deployment of the entire controller on RHEL 8.10;