OpenFabrics / fsdp_docs

Other
2 stars 3 forks source link

Verify beaker harness is available for Rawhide #99

Closed lylavoie closed 1 year ago

lylavoie commented 2 years ago

Veronika reported: it looks like the rawhide import is not complete as the harness for the distribution is missing. The jobs submissions is still failing for us with Failed to provision recipeid 1020, Failed to find repo for harness

JSpewock commented 2 years ago

This issue should be fixed, I've updated the test harness repositories for our Beaker instance and it now includes Rawhide

lylavoie commented 2 years ago

Veronika Kabátová commented on a discussion:

Thank you! The import looks good now, the installation starts. However, it gets stuck on storage configuration checks :/

Here is a test job with a few recipes as an example: https://beaker.ofa.iol.unh.edu/bkr/jobs/795

JSpewock commented 2 years ago

I managed to rerun the job and get to to move past this point of error and timeout. The problem ended up being the lines:

<repos>
        <repo name="beaker-harness" url="http://beaker.engineering.redhat.com/harness/Fedorarawhide/"/>
</repos>

The installation would make it to this stage and try to grab the repo and just timeout from there for what I can only assume was an error accessing the repo. Seems like it would try to grab it and fail silently until the beaker watchdog ran out.

veruu commented 2 years ago

Thank you! We have removed that from the jobs but the retry picked up the old xml to submit :see_no_evil: I have verified the following ones don't have the harness links, they do complete the installation:

https://beaker.ofa.iol.unh.edu/bkr/recipes/1064 https://beaker.ofa.iol.unh.edu/bkr/recipes/1069

These runs do get stuck too, unfortunately. Not sure yet if it's something machine related since the distro works on the internal RDMA machines.

Also ccing @mh21 on this issue directly for info

veruu commented 2 years ago

Quick update: Bruno did have an idea with disabling the mgag200 module and that did get us further:

https://beaker.ofa.iol.unh.edu/bkr/recipes/1079

And now we have a fancy kernel panic we don't recognize on our hands

JSpewock commented 2 years ago

Looking through the console logs for this failed job it looks like the rdma_setup.sh script is failing to run and throwing many errors just before the reboot that ends in a kernel panic, I'm not sure it even makes it to the post section you added in your job. The errors from the script all resemble /etc/sysconfig/network-scripts/*: No such file or directory

JSpewock commented 2 years ago

@veruu It's seeming like the kernel panic and the setup scripts failing are unrelated issues. I just installed the latest Rawhide image from June 1st in hopes that potentially some other people may have faced the same kernel panic issues and it would have been patched but I didn't read any of the release notes to verify this

veruu commented 2 years ago

@JSpewock thanks for the information! We actually had a similar idea to rerun a test earlier this week but instead figured out the VPN config on the runners got busted, so we have to fix that up first :see_no_evil:

dledford commented 2 years ago

We think adding selinux=0 to the default kernel command line for rawhide is likely to resolve the kernel panic and get things moving forward on this topic again. @JSpewock will try the command line option and see if that resolves the kernel panic.

JSpewock commented 2 years ago

With the addition of selinux=0 it seems like it's no longer throwing kernel panics when I provision systems but I did notice some weird activity on node-05 in testing. Seems to get stuck part of the way through installations but only around 50% of the time. I'll have to look more into what could be causing it, but the other nodes I tested such as node-01, node-06, and node-07 all provisioned cleanly every time.

JSpewock commented 2 years ago

Upon doing more testing, it doesn't look like we're rid of kernel panics yet. It's no longer an audit panic but it seems like across multiple hosts, not just node-05 like I had originally thought, it throws a fortify_panic because of some lib/string methods but only sometimes. I'll have to look into it some more

mh21 commented 2 years ago

so we got the VPN tunnel setup on the CKI side sorted, and now we are back at installations failing, e.g. https://beaker.ofa.iol.unh.edu/bkr/recipes/1155 🙈 - logs at http://beaker.ofa.iol.unh.edu/beaker/logs/recipes/1+/1155/console.log

JSpewock commented 2 years ago

@mh21 I'm not sure exactly what caused that job to fail but it looks like jobs 883-899 all provisioned fully and made it to login, I think they just aborted due to an issue of the "reservesys" task not reporting back after the timer ends so they end up going until the watchdog timer runs out. However, I think it might be a good idea, if at all possible, to move off of rawhide to do this testing as it seems inconsistent. Previously when we had the latest version it would kernel panic the majority of the time for an unknown reason when provisioned through beaker. Sometimes, every now and then, one would sneak through and complete the installation but it wasn't common and definitely not reliable. It seems to be working at this very moment but with it being a rolling release that is known to have issues when provisioned through our beaker environment in the past it might be a better idea to transition to using a more stable and consistent option.

veruu commented 2 years ago

Hi @JSpewock , based on the console logs @mh21 sent me the problem doesn't seem to be caused by rawhide but by not having restraint installed on the host. It also appears the regular Fedora repos are enabled on the hosts to require gpg checks, but the keys are not added which would prevent installation from the regular repos too:

You have enabled checking of packages via GPG keys. This is a good thing.
However, you do not have any GPG public keys installed. You need to download
the keys for packages you wish to install and install them.
You can do that by running the command:
    rpm --import public.gpg.key

For the restraint installation, can you enable the beaker-harness repository? The repo file is published at https://beaker-project.org/yum/beaker-harness-Fedora.repo

Fixing these two issues should move this forward. If restraint is not installed on the hosts, it cannot reach out to the labcontroller to retrieve the recipe and run it, which is why the runs got stuck.

JSpewock commented 2 years ago

@veruu I had thought GPG checks were automatically disabled on the yum repositories through the snippets but it wasn't working for the Fedora-Everything repo. This has been fixed and it now shouldn't check for GPG keys and I believe the harness repo is already added upon provision as when I provision a host with Rawhide it's already present so we could add it again in the snippets but this would end up being redundant

mh21 commented 2 years ago

Retriggering a job in https://beaker.ofa.iol.unh.edu/bkr/recipes/1192 via https://gitlab.com/redhat/red-hat-ci-tools/kernel/cki-ofa-pipelines/-/jobs/3053933163, it doesn't seem like restraint is actually installed 🤔

JSpewock commented 2 years ago

I did some looking into this and it looks like in your job under ks_meta you have the line harness='restraint-rhts beakerlib-redhat' and what this does is it adds the line $package_command -y install restraint-rhts beakerlib-redhat to the kickstart file which can be seen in the adding an alternate harness documentation. The problem with this is when this command is run it is unable to find the package beakerlib-redhat so this command will fail and neither package gets installed. Is this beakerlib-redhat package necessary? I was unable to find the repo that it would come from but there is a beakerlib package that could potentially be installed if that achieves something similar.

veruu commented 2 years ago

Thanks for looking into that, we changed the packages and let's see how it goes now :crossed_fingers:

dledford commented 1 year ago

@JSpewock I've uploaded a tarball to builder-00 in my home directory. beaker-harness.tgz has the necessary files in it for multiple OSes. Far too big for email.

JSpewock commented 1 year ago

I have added the harness packages and it looks like RHEL9 and Fedora 36 are provisioning fine now, the provisions jobs abort but that's because of the reservesys job. I'll add Fedora 37 as well and verify that it works with the new harness packages

dledford commented 1 year ago

As of today, rawhide installs are working as part of CKI jobs. I'll close this out.