Kubeinit / kubeinit

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...

https://www.kubeinit.org

Apache License 2.0

218 stars 57 forks source link

JQ missing if installed on Fedora 35 (minimal) #571

Closed tmlocher closed 2 years ago

tmlocher commented 2 years ago

Describe the bug When installing on Fedora35 (minimal, server) the install fails when checking the available space on libvirt directory

To Reproduce start with a server that is installed Fedora35 minimal install kubinit - okd

Expected behavior install fails with an error missing bash command "jq"

Screenshots Resolved by installing jq manually

Infrastructure

Hypervisors OS: Fedora35 minimal
Version 35
minimal server install

Deployment command ansible-playbook \ -vv --user root \ -e kubeinit_spec=okd-libvirt-3-6-4 \ -i ./kubeinit/inventory \ ./kubeinit/playbook.yml

ccamacho commented 2 years ago

Hello!!

@tmlocher thanks for raising this issue, you are right if you don't run the deployment from the container we ship, you need to manually have jq installed. We do ask for some pip packages to be installed in the README.md but we don't have jq, maybe if you could submit a simple PR adding jq to this list( https://github.com/Kubeinit/kubeinit/blob/main/kubeinit/README.md?plain=1#L109 ) it would welp others with the same issue. Just make sure you follow the lint notation for the PR's, the commit title should looks like 'fix: add jq to the required packages list' ...

Please feel free to ask any questions you have, and if you find any additional issue also feel free to raise issues in the project's repo (https://www.github.com/kubeinit/kubeinit). Also, it would be awesome if you can star the project to catch up with updates and new features.

ccamacho commented 2 years ago

@tmlocher just for curiosity, how long it took to deploy the 9 guests in the 4 hosts???

tmlocher commented 2 years ago

Dear Carlos, great to see you pick up so fast! Currently I am actually having a frustrating experience, ... I am not getting there AT ALL!

This is the error that is holding me up:

*TASK [kubeinit.kubeinit.kubeinit_services : Wait for connection to
"okdcluster-credentials" container]
*******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

*[WARNING]: Reset is not implemented for this connectionfatal: [localhost
-> okdcluster-credentials]: FAILED! => {"changed": false, "elapsed": 306,
"msg": "timed out waiting for ping module test: Failed to create temporary
directory.In some cases, you may have been able to authenticate and did not
have permissions on the target directory. Consider changing the remote tmp
path in ansible.cfg to a path rooted in \"/tmp\", for more error
information use -vvv. Failed command was: ( umask 77 && mkdir -p \"` echo
/tmp/MyAnsible/tmp `\"&& mkdir \"` echo
/tmp/MyAnsible/tmp/ansible-tmp-1639934238.251854-73409-253178350507575 `\"
&& echo ansible-tmp-1639934238.251854-73409-253178350507575=\"` echo
/tmp/MyAnsible/tmp/ansible-tmp-1639934238.251854-73409-253178350507575 `\"
), exited with result 125"}*

You may remember I have tried before, and I am coming back with the new installer, but still I am just not getting this down to earth! What I would like to achieve is an install across 4 servers:

srv01 --> controller, 1 compute (X3650, 63gb RAM, 24 CPU, 1 TB SSD, 10.10.0.254 (3 more IP))
srv02 --> controller, 1 compute (X3650, 63gb RAM, 24 CPU, 1 TB SSD, 10.10.0.254 (3 more IP))
srv03 --> controller, 1 compute (X3650, 63gb RAM, 24 CPU, 1 TB SSD, 10.10.0.254 (3 more IP))
srv04 --> controller, 2 compute, bastion and services (X3550, 24 CPU, 184 GB RAM, 1 TB SSD, 10.10.0.254 (3 more IP))

From a Hardware point of view I should not hit any issues, this is way more than most of your clients will run. At this point, it runs for about 12 minutes until it fails to find the container.

I have now scaled back to use just one machine: srv04 1 controller, 3 worker, still I am getting the same error as above.

Maybe you can give me any good hint where to look, it really makes no sense to me. with podman ps I can see two containers only:

0f466878f281  k8s.gcr.io/pause:3.5
     2 minutes ago       Up 2 minutes ago
d3817eee45f8-infra
7a634d03d557  localhost/kubeinit/okdcluster-credentials:latest  sleep
infinity  About a minute ago  Up About a minute ago
 okdcluster-credentials

Any good input welcome, I will return the favour by adding whatever is necessary to update kubeinit.

I am more than willing to testbed for you the multi machine set up! Regards Thomas

On Sun, 19 Dec 2021 at 18:01, Carlos Camacho @.***> wrote:

@tmlocher https://github.com/tmlocher just for curiosity, how long it took to deploy the 9 guests in the 4 hosts???

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997425779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6PYROA7SLARMSTOQUDURYFYPANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

Just re ran the install: 3 controllers and 8 workers run 6 mins 32 sec up to the error t

On Sun, 19 Dec 2021 at 18:01, Carlos Camacho @.***> wrote:

@tmlocher https://github.com/tmlocher just for curiosity, how long it took to deploy the 9 guests in the 4 hosts???

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997425779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6PYROA7SLARMSTOQUDURYFYPANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

@tmlocher do you know if the problem occurs on Fedora 34-1.2 ? I'll update my servers running that to 35 to see if I can reproduce. also might help to see your kickstart file to make sure I'm setting it up the same.

you might want to use the #kubeinit Slack channel when you need help like this, many of us who have these types of problems hang out there to help others.

tmlocher commented 2 years ago

I didn't kickstart it but did this manually, sorry.

Fedore35, minimal install no frills, (I tend to start with the bare minimum, then work my way up) I am now weeding my way through the install task by task to see where it is derailed. I will install the resource hog slack and come to the channel, good tip. so farI was avoiding it! Thomas

On Sun, 19 Dec 2021 at 18:59, Glenn Marcy @.***> wrote:

@tmlocher https://github.com/tmlocher do you know if the problem occurs on Fedora 34-1.2 ? I'll update my servers running that to 35 to see if I can reproduce. also might help to see your kickstart file to make sure I'm setting it up the same.

you might want to use the #kubeinit https://kubernetes.slack.com/archives/C01FKK19T0B Slack channel when you need help like this, many of us who have these types of problems hang out there to help others.

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997434570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6IZY2WWNIB3MOJVT43URYMRLANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

are you using rsa ssh keytype ? if so you need to run update-crypto-policies --set DEFAULT:FEDORA32 or change to something like ed25519 by setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

more info at https://fedoraproject.org/wiki/Changes/StrongCryptoSettings2

tmlocher commented 2 years ago

this I was not aware of! let me try and come back right after dinner! t

On Sun, 19 Dec 2021 at 19:15, Glenn Marcy @.***> wrote:

are you using rsa ssh keytype ? if so you need to run update-crypto-policies --set DEFAULT:FEDORA32 or change to something like ed25519 by setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997436804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6JLFZAUOQKP5JNPQZ3URYOLXANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

setting setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519" did the trick, not yet done but I got as far as TASK [kubeinit.kubeinit.kubeinit_services : Make sure we can execute remote commands on service before we continue] I expect this is security (SELinux) blocking once again. will continue tomorrow

On Sun, 19 Dec 2021 at 19:17, Thomas Locher @.***> wrote:

this I was not aware of! let me try and come back right after dinner! t

On Sun, 19 Dec 2021 at 19:15, Glenn Marcy @.***> wrote:

are you using rsa ssh keytype ? if so you need to run update-crypto-policies --set DEFAULT:FEDORA32 or change to something like ed25519 by setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997436804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6JLFZAUOQKP5JNPQZ3URYOLXANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

I guess I hit another flaw! srv01 has an ip of 10.10.0.251 srv02 has an ip of 10.10.0.252 srv03 has an ip of 10.10.0.253 srv04 has an ip of 10.10.0.254

but here we are going localhost to hypervisor-04, which is srv04 and hence 10.10.0.254

But we are pinging @. fatal: [localhost -> hypervisor-04(srv04.tmlocher.org)]: FAILED! => { "attempts": 30, "changed": true, "cmd": "set -o pipefail \nssh -i ~/.ssh/mycluster_id_ed25519 -o ConnectTimeout=5 -o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ProxyCommand= \"ssh -i ~/.ssh/mycluster_id_ed25519 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -W %h:%p -q @.\" @. 'echo connected' || true\n", "delta": "0:00:05.019092", "end": "2021-12-20 01:53:44.991716", "invocation": { "module_args": { "_raw_params": "set -o pipefail\nssh -i ~/.ssh/mycluster_id_ed25519 -o ConnectTimeout=5 -o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ProxyCommand=\"ssh -i ~/.ssh/mycluster_id_ed25519 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -W %h:%p -q @.\" @.*** 'echo connected' || true\n", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": "/bin/bash", "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false } }, "msg": "", "rc": 0, "start": "2021-12-20 01:53:39.972624", "stderr": "Connection timed out during banner exchange\r\nConnection to UNKNOWN port 65535 timed out", "stderr_lines": [ "Connection timed out during banner exchange", "Connection to UNKNOWN port 65535 timed out" ],

On Mon, 20 Dec 2021 at 00:38, Thomas Locher @.***> wrote:

setting setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519" did the trick, not yet done but I got as far as TASK [kubeinit.kubeinit.kubeinit_services : Make sure we can execute remote commands on service before we continue] I expect this is security (SELinux) blocking once again. will continue tomorrow

On Sun, 19 Dec 2021 at 19:17, Thomas Locher @.***> wrote:

this I was not aware of! let me try and come back right after dinner! t

On Sun, 19 Dec 2021 at 19:15, Glenn Marcy @.***> wrote:

are you using rsa ssh keytype ? if so you need to run update-crypto-policies --set DEFAULT:FEDORA32 or change to something like ed25519 by setting export KUBEINIT_COMMON_SSH_KEYTYPE="ed25519"

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-997436804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6JLFZAUOQKP5JNPQZ3URYOLXANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

@tmlocher I'm not sure I understand where port 65535 is coming from there. it might help to run with -vvv to get more details from ssh on what it's doing. Also might help to know what your modifications to inventory are, as well as any extra command line args you are using.

tmlocher commented 2 years ago

interim update: the install works if done on a single machine, so I could create a cluster with 3 controller and 5 compute on one hypervisor (3-5-1) (needs two more selinux tweaks but works) this install took 33 mins. now returning to multi-hypervisor install...

On Tue, 21 Dec 2021 at 00:36, Glenn Marcy @.***> wrote:

@tmlocher https://github.com/tmlocher I'm not sure I understand where port 65535 is coming from there. it might help to run with -vvv to get more details from ssh on what it's doing. Also might help to know what your modifications to inventory are, as well as any extra command line args you are using.

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-998348974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6ONNB44U7XASLWQ5UTUR64X3ANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

ccamacho commented 2 years ago

@tmlocher awesome!!! Those 33mins is a quite fast-fast deployment xD

The multinode deployment should work also as expected.

gmarcy commented 2 years ago

@tmlocher I am curious what the selinux issues would be. I've never needed to do anything with it when using kubeinit.

tmlocher commented 2 years ago

I get three entries that I had to address avc: denied { setgid } for pid=116672 comm="sss_cache" avc: denied { write } for pid=364523 comm="systemd-sysctl" name="kptr_restrict" avc: denied { write } for pid=94749 comm="udevadm" name="uevent" dev="sysfs"

Once I permissioned the install ran rather swiftly. What I notice though is, the install does not seem to make too much use of parallel CPU Power. I am still too little into Ansible to define if this is inherent or if it is the script. What I can see though is that only the first 2-4 CPUs are working, with as many as 20 CPUs sitting idle. This is astonishing as even the smallest VMs have 8 cores allocated. If we manage to activate these, it should take out a chunk of time.

As I am at it, let me gripe a little on the documentation: looking through the code I can see lots of good things hidden, but they are hardly documented. I can see you have the structures to run just a partial build (and I guess to restart from a defined point) well I cannot find the documentation for that. This could make it much easier to set up, ... Then, in the Inventory you have two elements that seem to be related but I have yet to figure out how in detail: there is the entry of "target order" and there is the target that I can set explicitly.

As I fail consistently to install across more than one physical machine, I would sometimes wish to have maybe two ready made inventory files: the one you already provide and a second one that set the scene for 4+ machines, 3 individual and separate controllers and one "workhorse" where the load goes, ... that would be sooo much easier. Don't get me wrong I enjoy taking the curve, but really I am interested in working OKD, not fiddle SELinux/NEtwork/etc etc

Glad at least I can have some support and time of you

Regards Thomas

On Tue, 21 Dec 2021 at 17:28, Glenn Marcy @.***> wrote:

@tmlocher https://github.com/tmlocher I am curious what the selinux issues would be. I've never needed to do anything with it when using kubeinit.

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-998921067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6LW36HLQUCNWJUCYP3USCTLDANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

Thomas,

as you might imagine, we don't see a lot of large complex topologies, and with size comes complexity and more degrees of freedom. that said, we've certainly given some thought to what might be helpful to provide controls for and have a few of those delivered but relatively untested without having a scenario to match. as you've pointed out, we just have the one inventory which is tweaked in various ways in the ci jobs to create some more complicated deployments.

would you be able to summarize your deployment objectives so that we might be able to suggest some options which might be worth trying out. you mentioned the target order and target attributes, which is an excellent example of something we've only added recently and don't really do anything with just to start having some conversations about what makes sense. since more of our experiences are with single hvs they aren't used in a way that would help in your topology.

-Glenn

tmlocher commented 2 years ago

Dear Glenn, let me put this right: I am not here to gripe or shade your work to date, I see a lot of potential in the approach you have taken and I would like to help push this forward, so I am here to help and pick up my part. Most of the people I see picking up OKD here in Switzerland/Germany are either students or interested individuals. I don't see too many small to medium sized companies, which is a shame as cloud based "daddling" quickly becomes very expensive. Second hand equipment is pretty easy to come by, so I spent less than 2k USD for approx 120 CPUs and 1TB RAM! But it is a chore to administer. I want to toy OKD not the underlying infra, and your project allows me to do that (yes I would love to see the same for Openstack: I add my bare metal to the init and go go go, ...) Any small shop could do this, sitting on commercial grade HW building cloud applications that are truly kubernetes/OKD native.

How can I best help you? I offer to install, re-install, install with defined parameters, you tell me what you need I'll do the testing for you as much as I can! My ideal KubeInit solution would look something like this:

1) Pre-Install:

Controller: minimally 1, ideally 3 physical boxes, (sufficient RAM/HDD/CPU, 1 or better 2 Network Interfaces)

Worker: minimally 1 better more physical boxes, (sufficient RAM/HDD/CPU, 1 or better 2 Network Interfaces)

Services: 0 to n physical boxes.

All are attached to the network and accessible

I run a small ansible script to discover the topology and get a structured output

The script provides one or more Inventory-files and the command line to trigger deployment, both I can potentially tweak to my needs

2) Install

I run the script with the inventory derived above

3) Result

I have an OKD cluster running, a defined easy "external" interface (yes, that cliff I am yet to take!) which I can readily hook into my network with a known - given - IP address for development teams to toy around with in less than a day!

I believe you share this vision, and it will allow more development taking place at a lower price point, which in turn will fast forward the adoption of the cloud. You have already done a massive amount of work. If I had to judge, I would argue the level of competence you all have is too high, so you don't see how some of us outsiders struggle with the low level issues you don't even see as they are second nature to you! This is where I guess I come into play: testing, giving feedback, testing again. Does that sound right to you?

Regards T

On Tue, 21 Dec 2021 at 23:38, Glenn Marcy @.***> wrote:

Thomas,

as you might imagine, we don't see a lot of large complex topologies, and with size comes complexity and more degrees of freedom. that said, we've certainly given some thought to what might be helpful to provide controls for and have a few of those delivered but relatively untested without having a scenario to match. as you've pointed out, we just have the one inventory which is tweaked in various ways in the ci jobs to create some more complicated deployments.

would you be able to summarize your deployment objectives so that we might be able to suggest some options which might be worth trying out. you mentioned the target order and target attributes, which is an excellent example of something we've only added recently and don't really do anything with just to start having some conversations about what makes sense. since more of our experiences are with single hvs they aren't used in a way that would help in your topology.

-Glenn

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-999141378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6MFDREVP2H55O3BATLUSD6XZANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

Here some more thoughts: controllers are few, so I would expect to define which hypervisor to use (either fully or at least in an ansible group, similar services, as this is the "gate" On worker nodes there are a number of angles to consider:

I may want to take all and then exclude
maybe I have different subsets which lends itself to what you do: prioritize
then again ansible offers groups so why not just rely on that, ... for me of importance is to be able to differentiate: this is a controller hypervisor, this a worker. (not a must but I would like to be able to!)

On Tue, 21 Dec 2021 at 23:38, Glenn Marcy @.***> wrote:

Thomas,

as you might imagine, we don't see a lot of large complex topologies, and with size comes complexity and more degrees of freedom. that said, we've certainly given some thought to what might be helpful to provide controls for and have a few of those delivered but relatively untested without having a scenario to match. as you've pointed out, we just have the one inventory which is tweaked in various ways in the ci jobs to create some more complicated deployments.

would you be able to summarize your deployment objectives so that we might be able to suggest some options which might be worth trying out. you mentioned the target order and target attributes, which is an excellent example of something we've only added recently and don't really do anything with just to start having some conversations about what makes sense. since more of our experiences are with single hvs they aren't used in a way that would help in your topology.

-Glenn

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-999141378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6MFDREVP2H55O3BATLUSD6XZANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

her is a new one (will look at this later!) : fatal: [localhost -> service(10.0.0.253)]: FAILED! => {"changed": false, "cmd": "set -eo pipefail\noc adm release mirror --registry-config=registry-auths.json --from= quay.io/openshift/okd:4.9.0-0.okd-2021-11-28-035710 --to=service.mycluster.kubeinit.local:5000/okd --to-release-image=service.mycluster.kubeinit.local:5000/okd:4.9.0-0.okd-2021-11-28-035710 2>&1 | tee mirror-output.txt\noc adm release extract --registry-config=registry-auths.json --command=openshift-install \" service.mycluster.kubeinit.local:5000/okd:4.9.0-0.okd-2021-11-28-035710\"\noc adm release extract --registry-config=registry-auths.json --command=oc \" service.mycluster.kubeinit.local:5000/okd:4.9.0-0.okd-2021-11-28-035710\"\n# This will override the current client and installer binaries\ncp oc openshift-install /usr/local/bin/\noc version\nopenshift-install version\n", "delta": "0:00:00.703257", "end": "2021-12-22 13:07:00.937199", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 13:07:00.233942", "stderr": "", "stderr_lines": [], "stdout": "error: unable to retrieve release image info: unable to read image quay.io/openshift/okd:4.9.0-0.okd-2021-11-28-035710: endpoint \" https://quay.io\" does not support v2 API (got 502 Bad Gateway)", "stdout_lines": ["error: unable to retrieve release image info: unable to read image quay.io/openshift/okd:4.9.0-0.okd-2021-11-28-035710: endpoint \" https://quay.io\" does not support v2 API (got 502 Bad Gateway)"]}

On Wed, 22 Dec 2021 at 13:21, Thomas Locher @.***> wrote:

Here some more thoughts: controllers are few, so I would expect to define which hypervisor to use (either fully or at least in an ansible group, similar services, as this is the "gate" On worker nodes there are a number of angles to consider:

I may want to take all and then exclude

maybe I have different subsets which lends itself to what you do: prioritize

then again ansible offers groups so why not just rely on that, ... for me of importance is to be able to differentiate: this is a controller hypervisor, this a worker. (not a must but I would like to be able to!)

On Tue, 21 Dec 2021 at 23:38, Glenn Marcy @.***> wrote:

Thomas,

as you might imagine, we don't see a lot of large complex topologies, and with size comes complexity and more degrees of freedom. that said, we've certainly given some thought to what might be helpful to provide controls for and have a few of those delivered but relatively untested without having a scenario to match. as you've pointed out, we just have the one inventory which is tweaked in various ways in the ci jobs to create some more complicated deployments.

would you be able to summarize your deployment objectives so that we might be able to suggest some options which might be worth trying out. you mentioned the target order and target attributes, which is an excellent example of something we've only added recently and don't really do anything with just to start having some conversations about what makes sense. since more of our experiences are with single hvs they aren't used in a way that would help in your topology.

-Glenn

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-999141378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6MFDREVP2H55O3BATLUSD6XZANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

ccamacho commented 2 years ago

@tmlocher I've seen this all day long.. it looks like a quay issue.

tmlocher commented 2 years ago

Great, thanks!

On Wed, 22 Dec 2021 at 21:44, Carlos Camacho @.***> wrote:

@tmlocher https://github.com/tmlocher I've seen this all day long.. it looks like a quay issue.

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-999869682, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6LT62X7F6WB4VQVA4LUSI2EJANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

ran this tonight:

ansible-playbook -vv --user root -e kubeinit_spec=okd-libvirt-3-6-4 -i kubeinit/inventory kubeinit/playbook.yml

my only changes to what is on the main branch now was to the inventory:

$ git diff
diff --git a/kubeinit/inventory b/kubeinit/inventory
index 960493df..bccb6321 100644
--- a/kubeinit/inventory
+++ b/kubeinit/inventory
@@ -77,6 +77,8 @@ kubeinit_inventory_post_deployment_services="none"
 [hypervisor_hosts]
 hypervisor-01 ansible_host=nyctea
 hypervisor-02 ansible_host=tyto
+hypervisor-03 ansible_host=strix
+hypervisor-04 ansible_host=otus

 # The inventory will have one host identified as the bastion host. By default, this role will
 # be assumed by the first hypervisor, which is the same behavior as the first commented out
@@ -88,6 +90,7 @@ hypervisor-02 ansible_host=tyto
 # bastion target=hypervisor-01
 # bastion target=hypervisor-02
 # bastion ansible_host=bastion
+bastion target=hypervisor-04

 # The inventory will have one host identified as the ovn-central host.  By default, this role
 # will be assumed by the first hypervisor, which is the same behavior as the first commented
@@ -97,6 +100,7 @@ hypervisor-02 ansible_host=tyto
 [ovn_central_host]
 # ovn-central target=hypervisor-01
 # ovn-central target=hypervisor-02
+ovn-central target=hypervisor-04

 #
 # Cluster node definitions
@@ -118,6 +122,9 @@ type=virtual
 target_order=hypervisor-01

 [controller_nodes]
+controller-01 target=hypervisor-01
+controller-02 target=hypervisor-02
+controller-03 target=hypervisor-03

 [compute_nodes:vars]
 os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
@@ -129,6 +136,12 @@ type=virtual
 target_order="hypervisor-02,hypervisor-01"

 [compute_nodes]
+compute-01 target=hypervisor-01
+compute-02 target=hypervisor-04
+compute-03 target=hypervisor-02
+compute-04 target=hypervisor-04
+compute-05 target=hypervisor-03
+compute-06 target=hypervisor-04

 [extra_nodes:vars]
 os={'cdk': 'ubuntu', 'okd': 'coreos'}
@@ -141,7 +154,7 @@ target_order="hypervisor-02,hypervisor-01"

 [extra_nodes]
 juju-controller distro=cdk
-bootstrap distro=okd
+bootstrap target=hypervisor-04 distro=okd

 # Service nodes are a set of service containers sharing the same pod network.
 # There is an implicit 'provision' service container which will use a base os
@@ -152,4 +165,4 @@ os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': '
 target_order=hypervisor-01

 [service_nodes]
-service services="bind,dnsmasq,haproxy,apache,registry" # nexus
+service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus

and adding jq to the centos family distribution prereqs list:

diff --git a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml
index 42ae0725..ae655222 100644
--- a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml
+++ b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml
@@ -98,6 +98,7 @@ kubeinit_libvirt_hypervisor_dependencies:
     - net-tools
     - xz
     - perl-XML-XPath
+    - jq
   debian:
     - sudo
     - numad

that ran to completion:

PLAY RECAP *********************************************************************************************************************************************************************************************
hypervisor-01              : ok=17   changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
hypervisor-02              : ok=17   changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
hypervisor-03              : ok=17   changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
hypervisor-04              : ok=17   changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
localhost                  : ok=603  changed=264  unreachable=0    failed=0    skipped=177  rescued=0    ignored=0

I didn't need to make any changes to the 4 hypervisor hosts, blank slate boot of Fedora 35 with kickstart config of

%packages
@^custom-environment
@standard

%end

server guests and services:

$ ssh root@nyctea virsh list
 Id   Name                       State
------------------------------------------
 2    okdcluster-controller-01   running
 4    okdcluster-compute-01      running

$ ssh root@tyto virsh list
 Id   Name                       State
------------------------------------------
 2    okdcluster-controller-02   running
 4    okdcluster-compute-03      running

$ ssh root@strix virsh list
 Id   Name                       State
------------------------------------------
 2    okdcluster-controller-03   running
 4    okdcluster-compute-05      running

$ ssh root@otus virsh list
 Id   Name                    State
---------------------------------------
 4    okdcluster-compute-02   running
 6    okdcluster-compute-04   running
 8    okdcluster-compute-06   running

$ ssh root@otus podman ps -a
CONTAINER ID  IMAGE                                              COMMAND               CREATED            STATUS                PORTS       NAMES
e7341de8c256  k8s.gcr.io/pause:3.5                                                     About an hour ago  Up About an hour ago              b4d4e7ccc3a9-infra
22f9d4cf7b66  localhost/kubeinit/okdcluster-bind:latest          /usr/sbin/named -...  About an hour ago  Up About an hour ago              okdcluster-bind
a57980b46863  localhost/kubeinit/okdcluster-dnsmasq:latest       -d -q -C /etc/dns...  About an hour ago  Up About an hour ago              okdcluster-dnsmasq
798bcd71b0d0  localhost/kubeinit/okdcluster-haproxy:latest       haproxy -f /usr/l...  About an hour ago  Up 50 minutes ago                 okdcluster-haproxy
f41f84ca030e  localhost/kubeinit/okdcluster-apache:latest        httpd-foreground      About an hour ago  Up About an hour ago              okdcluster-apache
6c3db49acccc  localhost/kubeinit/okdcluster-registry:latest      /etc/docker/regis...  About an hour ago  Up About an hour ago              okdcluster-registry
c08ff8fdb9eb  localhost/kubeinit/okdcluster-provision:latest                           About an hour ago  Up About an hour ago              okdcluster-provision
a4bcdfdbea9a  k8s.gcr.io/pause:3.5                                                     29 minutes ago     Up 29 minutes ago                 a3dd3db856fd-infra
7c86fabea90a  localhost/kubeinit/okdcluster-ingress-bind:latest  /usr/sbin/named -...  29 minutes ago     Up 29 minutes ago                 okdcluster-ingress-bind

# oc get nodes
NAME      STATUS   ROLES    AGE   VERSION
master0   Ready    master   43m   v1.22.3+4dd1b5a
master1   Ready    master   43m   v1.22.3+4dd1b5a
master2   Ready    master   42m   v1.22.3+4dd1b5a
worker0   Ready    worker   27m   v1.22.3+4dd1b5a
worker1   Ready    worker   26m   v1.22.3+4dd1b5a
worker2   Ready    worker   27m   v1.22.3+4dd1b5a
worker3   Ready    worker   26m   v1.22.3+4dd1b5a
worker4   Ready    worker   24m   v1.22.3+4dd1b5a
worker5   Ready    worker   22m   v1.22.3+4dd1b5a

No issues connecting to the openshift console after running the create-external-ingress.sh script with these fixes for fedora -- change the two lines containing centos:

from: if [ "$ID" == "centos" ]; then

to: if [ "$ID" == "centos" -o "$ID" == "fedora" ]; then

tmlocher commented 2 years ago

Dear Marcy, I didn't have a lot of time to fiddle today, I pulled a brand new Kubinit, applied the above changes, ran it --> failed as below. Changed the DNS to omit some hardcoded names, still the same issue. I didn't yet have time to dig into the depths of it, I expect some "$" missing, ... (this time the failure comes much earlier)

On Fri, 24 Dec 2021 at 07:14, Glenn Marcy @.***> wrote:

ran this tonight:

ansible-playbook -vv --user root -e kubeinit_spec=okd-libvirt-3-6-4 -i kubeinit/inventory kubeinit/playbook.yml

my only changes to what is on the main branch now was to the inventory:

$ git diff diff --git a/kubeinit/inventory b/kubeinit/inventory index 960493df..bccb6321 100644 --- a/kubeinit/inventory +++ b/kubeinit/inventory @@ -77,6 +77,8 @@ kubeinit_inventory_post_deployment_services="none" [hypervisor_hosts] hypervisor-01 ansible_host=nyctea hypervisor-02 ansible_host=tyto +hypervisor-03 ansible_host=strix +hypervisor-04 ansible_host=otus

The inventory will have one host identified as the bastion host. By default, this role will

be assumed by the first hypervisor, which is the same behavior as the first commented out

@@ -88,6 +90,7 @@ hypervisor-02 ansible_host=tyto

bastion target=hypervisor-01

bastion target=hypervisor-02

bastion ansible_host=bastion

+bastion target=hypervisor-04

The inventory will have one host identified as the ovn-central host. By default, this role

will be assumed by the first hypervisor, which is the same behavior as the first commented

@@ -97,6 +100,7 @@ hypervisor-02 ansible_host=tyto [ovn_central_host]

ovn-central target=hypervisor-01

ovn-central target=hypervisor-02

+ovn-central target=hypervisor-04

#

Cluster node definitions

@@ -118,6 +122,9 @@ type=virtual target_order=hypervisor-01

[controller_nodes] +controller-01 target=hypervisor-01 +controller-02 target=hypervisor-02 +controller-03 target=hypervisor-03

[compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} @@ -129,6 +136,12 @@ type=virtual target_order="hypervisor-02,hypervisor-01"

[compute_nodes] +compute-01 target=hypervisor-01 +compute-02 target=hypervisor-04 +compute-03 target=hypervisor-02 +compute-04 target=hypervisor-04 +compute-05 target=hypervisor-03 +compute-06 target=hypervisor-04

[extra_nodes:vars] os={'cdk': 'ubuntu', 'okd': 'coreos'} @@ -141,7 +154,7 @@ target_order="hypervisor-02,hypervisor-01"

[extra_nodes] juju-controller distro=cdk -bootstrap distro=okd +bootstrap target=hypervisor-04 distro=okd

Service nodes are a set of service containers sharing the same pod network.

There is an implicit 'provision' service container which will use a base os

@@ -152,4 +165,4 @@ os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': ' target_order=hypervisor-01

[service_nodes] -service services="bind,dnsmasq,haproxy,apache,registry" # nexus +service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus

and adding jq to the centos family distribution prereqs list:

diff --git a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml index 42ae0725..ae655222 100644 --- a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml +++ b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml @@ -98,6 +98,7 @@ kubeinit_libvirt_hypervisor_dependencies:

net-tools

xz

perl-XML-XPath

jq debian:

sudo

numad

that ran to completion:

PLAY RECAP ***** hypervisor-01 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-02 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-03 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-04 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 localhost : ok=603 changed=264 unreachable=0 failed=0 skipped=177 rescued=0 ignored=0

I didn't need to make any changes to the 4 hypervisor hosts, blank slate boot of Fedora 35 with kickstart config of

%packages @^custom-environment @standard

%end

server guests and services:

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-01 running 4 okdcluster-compute-01 running

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-02 running 4 okdcluster-compute-03 running

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-03 running 4 okdcluster-compute-05 running

$ ssh @.*** virsh list Id Name State

4 okdcluster-compute-02 running 6 okdcluster-compute-04 running 8 okdcluster-compute-06 running

$ ssh @.*** podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e7341de8c256 k8s.gcr.io/pause:3.5 About an hour ago Up About an hour ago b4d4e7ccc3a9-infra 22f9d4cf7b66 localhost/kubeinit/okdcluster-bind:latest /usr/sbin/named -... About an hour ago Up About an hour ago okdcluster-bind a57980b46863 localhost/kubeinit/okdcluster-dnsmasq:latest -d -q -C /etc/dns... About an hour ago Up About an hour ago okdcluster-dnsmasq 798bcd71b0d0 localhost/kubeinit/okdcluster-haproxy:latest haproxy -f /usr/l... About an hour ago Up 50 minutes ago okdcluster-haproxy f41f84ca030e localhost/kubeinit/okdcluster-apache:latest httpd-foreground About an hour ago Up About an hour ago okdcluster-apache 6c3db49acccc localhost/kubeinit/okdcluster-registry:latest /etc/docker/regis... About an hour ago Up About an hour ago okdcluster-registry c08ff8fdb9eb localhost/kubeinit/okdcluster-provision:latest About an hour ago Up About an hour ago okdcluster-provision a4bcdfdbea9a k8s.gcr.io/pause:3.5 29 minutes ago Up 29 minutes ago a3dd3db856fd-infra 7c86fabea90a localhost/kubeinit/okdcluster-ingress-bind:latest /usr/sbin/named -... 29 minutes ago Up 29 minutes ago okdcluster-ingress-bind

oc get nodes

NAME STATUS ROLES AGE VERSION master0 Ready master 43m v1.22.3+4dd1b5a master1 Ready master 43m v1.22.3+4dd1b5a master2 Ready master 42m v1.22.3+4dd1b5a worker0 Ready worker 27m v1.22.3+4dd1b5a worker1 Ready worker 26m v1.22.3+4dd1b5a worker2 Ready worker 27m v1.22.3+4dd1b5a worker3 Ready worker 26m v1.22.3+4dd1b5a worker4 Ready worker 24m v1.22.3+4dd1b5a worker5 Ready worker 22m v1.22.3+4dd1b5a

No issues connecting to the openshift console after running the create-external-ingress.sh script with these fixes for fedora -- change the two lines containing centos:

from: if [ "$ID" == "centos" ]; then

to: if [ "$ID" == "centos" -o "$ID" == "fedora" ]; then

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-1000675061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6ORPXDK7WZS4PF4UH3USQFTBANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

not yet into the details, but if I go back to 19284cb7158edb67c978ce983c60dbaec31952d6 the install gets beyond above mentioned error. t

On Tue, 28 Dec 2021 at 23:06, Thomas Locher @.***> wrote:

Dear Marcy, I didn't have a lot of time to fiddle today, I pulled a brand new Kubinit, applied the above changes, ran it --> failed as below. Changed the DNS to omit some hardcoded names, still the same issue. I didn't yet have time to dig into the depths of it, I expect some "$" missing, ... (this time the failure comes much earlier)

On Fri, 24 Dec 2021 at 07:14, Glenn Marcy @.***> wrote:

ran this tonight:

ansible-playbook -vv --user root -e kubeinit_spec=okd-libvirt-3-6-4 -i kubeinit/inventory kubeinit/playbook.yml

my only changes to what is on the main branch now was to the inventory:

$ git diff diff --git a/kubeinit/inventory b/kubeinit/inventory index 960493df..bccb6321 100644 --- a/kubeinit/inventory +++ b/kubeinit/inventory @@ -77,6 +77,8 @@ kubeinit_inventory_post_deployment_services="none" [hypervisor_hosts] hypervisor-01 ansible_host=nyctea hypervisor-02 ansible_host=tyto +hypervisor-03 ansible_host=strix +hypervisor-04 ansible_host=otus

The inventory will have one host identified as the bastion host. By default, this role will

be assumed by the first hypervisor, which is the same behavior as the first commented out

@@ -88,6 +90,7 @@ hypervisor-02 ansible_host=tyto

bastion target=hypervisor-01

bastion target=hypervisor-02

bastion ansible_host=bastion

+bastion target=hypervisor-04

The inventory will have one host identified as the ovn-central host. By default, this role

will be assumed by the first hypervisor, which is the same behavior as the first commented

@@ -97,6 +100,7 @@ hypervisor-02 ansible_host=tyto [ovn_central_host]

ovn-central target=hypervisor-01

ovn-central target=hypervisor-02

+ovn-central target=hypervisor-04

#

Cluster node definitions

@@ -118,6 +122,9 @@ type=virtual target_order=hypervisor-01

[controller_nodes] +controller-01 target=hypervisor-01 +controller-02 target=hypervisor-02 +controller-03 target=hypervisor-03

[compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} @@ -129,6 +136,12 @@ type=virtual target_order="hypervisor-02,hypervisor-01"

[compute_nodes] +compute-01 target=hypervisor-01 +compute-02 target=hypervisor-04 +compute-03 target=hypervisor-02 +compute-04 target=hypervisor-04 +compute-05 target=hypervisor-03 +compute-06 target=hypervisor-04

[extra_nodes:vars] os={'cdk': 'ubuntu', 'okd': 'coreos'} @@ -141,7 +154,7 @@ target_order="hypervisor-02,hypervisor-01"

[extra_nodes] juju-controller distro=cdk -bootstrap distro=okd +bootstrap target=hypervisor-04 distro=okd

Service nodes are a set of service containers sharing the same pod network.

There is an implicit 'provision' service container which will use a base os

@@ -152,4 +165,4 @@ os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': ' target_order=hypervisor-01

[service_nodes] -service services="bind,dnsmasq,haproxy,apache,registry" # nexus +service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus

and adding jq to the centos family distribution prereqs list:

diff --git a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml index 42ae0725..ae655222 100644 --- a/kubeinit/roles/kubeinit_libvirt/defaults/main.yml +++ b/kubeinit/roles/kubeinit_libvirt/defaults/main.yml @@ -98,6 +98,7 @@ kubeinit_libvirt_hypervisor_dependencies:

net-tools

xz

perl-XML-XPath

jq debian:

sudo

numad

that ran to completion:

PLAY RECAP ***** hypervisor-01 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-02 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-03 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 hypervisor-04 : ok=17 changed=4 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 localhost : ok=603 changed=264 unreachable=0 failed=0 skipped=177 rescued=0 ignored=0

I didn't need to make any changes to the 4 hypervisor hosts, blank slate boot of Fedora 35 with kickstart config of

%packages @^custom-environment @standard

%end

server guests and services:

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-01 running 4 okdcluster-compute-01 running

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-02 running 4 okdcluster-compute-03 running

$ ssh @.*** virsh list Id Name State

2 okdcluster-controller-03 running 4 okdcluster-compute-05 running

$ ssh @.*** virsh list Id Name State

4 okdcluster-compute-02 running 6 okdcluster-compute-04 running 8 okdcluster-compute-06 running

$ ssh @.*** podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e7341de8c256 k8s.gcr.io/pause:3.5 About an hour ago Up About an hour ago b4d4e7ccc3a9-infra 22f9d4cf7b66 localhost/kubeinit/okdcluster-bind:latest /usr/sbin/named -... About an hour ago Up About an hour ago okdcluster-bind a57980b46863 localhost/kubeinit/okdcluster-dnsmasq:latest -d -q -C /etc/dns... About an hour ago Up About an hour ago okdcluster-dnsmasq 798bcd71b0d0 localhost/kubeinit/okdcluster-haproxy:latest haproxy -f /usr/l... About an hour ago Up 50 minutes ago okdcluster-haproxy f41f84ca030e localhost/kubeinit/okdcluster-apache:latest httpd-foreground About an hour ago Up About an hour ago okdcluster-apache 6c3db49acccc localhost/kubeinit/okdcluster-registry:latest /etc/docker/regis... About an hour ago Up About an hour ago okdcluster-registry c08ff8fdb9eb localhost/kubeinit/okdcluster-provision:latest About an hour ago Up About an hour ago okdcluster-provision a4bcdfdbea9a k8s.gcr.io/pause:3.5 29 minutes ago Up 29 minutes ago a3dd3db856fd-infra 7c86fabea90a localhost/kubeinit/okdcluster-ingress-bind:latest /usr/sbin/named -... 29 minutes ago Up 29 minutes ago okdcluster-ingress-bind

oc get nodes

NAME STATUS ROLES AGE VERSION master0 Ready master 43m v1.22.3+4dd1b5a master1 Ready master 43m v1.22.3+4dd1b5a master2 Ready master 42m v1.22.3+4dd1b5a worker0 Ready worker 27m v1.22.3+4dd1b5a worker1 Ready worker 26m v1.22.3+4dd1b5a worker2 Ready worker 27m v1.22.3+4dd1b5a worker3 Ready worker 26m v1.22.3+4dd1b5a worker4 Ready worker 24m v1.22.3+4dd1b5a worker5 Ready worker 22m v1.22.3+4dd1b5a

No issues connecting to the openshift console after running the create-external-ingress.sh script with these fixes for fedora -- change the two lines containing centos:

from: if [ "$ID" == "centos" ]; then

to: if [ "$ID" == "centos" -o "$ID" == "fedora" ]; then

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-1000675061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6ORPXDK7WZS4PF4UH3USQFTBANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

@tmlocher It looks like the details didn't make it to the github issue, the one you mentioned as failed as below and above mentioned error in your last two comments, so I'm not sure what the issue you have encountered was with the latest code in main.

tmlocher commented 2 years ago

ok so I will try again, ... with head I get the error that the hostnames are not properly set, I expect a missing "$" somewhere: TASK [kubeinit.kubeinit.kubeinit_prepare : Stop before 'task-prepare-hypervisors' when requested] *** task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/prepare_hypervisors.yml: 25 skipping: [localhost] => { "changed": false, "skip_reason": "Conditional result was False" }

TASK [kubeinit.kubeinit.kubeinit_prepare : End play]

task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/prepare_hypervisors.yml: 27 META: skipping: [localhost] => { "msg": "", "skip_reason": "end_play conditional evaluated to False, continuing play" } META: META: ran handlers META: ran handlers

PLAY [Prepare all hypervisor hosts to deploy service and cluster nodes]

META: ran handlers

TASK [Skip play if playbook_terminated]

task path: /home/manager/repos/kubeinit/kubeinit/playbook.yml:53 META: skipping: [hypervisor-01] => { "msg": "", "skip_reason": "end_play conditional evaluated to False, continuing play" }

TASK [Prepare each hypervisor in the deployment]

task path: /home/manager/repos/kubeinit/kubeinit/playbook.yml:61 fatal: [hypervisor-01]: FAILED! => { "msg": "The conditional check 'inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: error while evaluating conditional (inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare each hypervisor in the deployment\n ^ here\n" } fatal: [hypervisor-02]: FAILED! => { "msg": "The conditional check 'inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: error while evaluating conditional (inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare each hypervisor in the deployment\n ^ here\n" } fatal: [hypervisor-03]: FAILED! => { "msg": "The conditional check 'inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: error while evaluating conditional (inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare each hypervisor in the deployment\n ^ here\n" } fatal: [hypervisor-04]: FAILED! => { "msg": "The conditional check 'inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: error while evaluating conditional (inventory_hostname in hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare each hypervisor in the deployment\n ^ here\n" }

PLAY RECAP

hypervisor-01 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 hypervisor-02 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 hypervisor-03 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 hypervisor-04 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 localhost : ok=137 changed=15 unreachable=0 failed=0 skipped=77 rescued=0 ignored=0

On Wed, 29 Dec 2021, 17:00 Glenn Marcy, @.***> wrote:

@tmlocher https://github.com/tmlocher It looks like the details didn't make it to the github issue, the one you mentioned as failed as below and above mentioned error in your last two comments, so I'm not sure what the issue you have encountered was with the latest code in main.

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-1002663652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6K5HJALXQ34RMIWM5LUTMWCXANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

tmlocher commented 2 years ago

so I used 19284cb7158edb67c978ce983c60dbaec31952d6 to install, I get as far as the below before it seems to die as it is sitting on this now for almost half an hour.

TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] *** task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml: 31 Using module file /home/manager/.local/lib/python3.9 /site-packages/ansible/modules/command.py Pipelining is enabled.

ESTABLISH SSH CONNECTION FOR USER: root SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ControlPath=/home/manager/.ansible/cp/fafae98888 srv01 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"'' On Wed, 29 Dec 2021 at 22:15, Thomas Locher ***@***.***> wrote: > ok so I will try again, ... > with head I get the error that the hostnames are not properly set, I > expect a missing "$" somewhere: > TASK [kubeinit.kubeinit.kubeinit_prepare : Stop before > 'task-prepare-hypervisors' when requested] *** > task path: > /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/prepare_hypervisors.yml: > 25 > skipping: [localhost] => { > "changed": false, > "skip_reason": "Conditional result was False" > } > > TASK [kubeinit.kubeinit.kubeinit_prepare : End play] > *************************** > task path: > /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/prepare_hypervisors.yml: > 27 > META: > skipping: [localhost] => { > "msg": "", > "skip_reason": "end_play conditional evaluated to False, continuing play" > } > META: > META: ran handlers > META: ran handlers > > PLAY [Prepare all hypervisor hosts to deploy service and cluster nodes] > ******** > META: ran handlers > > TASK [Skip play if playbook_terminated] > **************************************** > task path: /home/manager/repos/kubeinit/kubeinit/playbook.yml:53 > META: > skipping: [hypervisor-01] => { > "msg": "", > "skip_reason": "end_play conditional evaluated to False, continuing play" > } > > TASK [Prepare each hypervisor in the deployment] > ******************************* > task path: /home/manager/repos/kubeinit/kubeinit/playbook.yml:61 > fatal: [hypervisor-01]: FAILED! => { > "msg": "The conditional check 'inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: > error while evaluating conditional (inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ > 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in > '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, > but may\nbe elsewhere in the file depending on the exact syntax > problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare > each hypervisor in the deployment\n ^ here\n" > } > fatal: [hypervisor-02]: FAILED! => { > "msg": "The conditional check 'inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: > error while evaluating conditional (inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ > 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in > '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, > but may\nbe elsewhere in the file depending on the exact syntax > problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare > each hypervisor in the deployment\n ^ here\n" > } > fatal: [hypervisor-03]: FAILED! => { > "msg": "The conditional check 'inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: > error while evaluating conditional (inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ > 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in > '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, > but may\nbe elsewhere in the file depending on the exact syntax > problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare > each hypervisor in the deployment\n ^ here\n" > } > fatal: [hypervisor-04]: FAILED! => { > "msg": "The conditional check 'inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors' failed. The error was: > error while evaluating conditional (inventory_hostname in > hostvars['kubeinit-cluster-facts'].hypervisors): \"hostvars[ > 'kubeinit-cluster-facts']\" is undefined\n\nThe error appears to be in > '/home/manager/repos/kubeinit/kubeinit/playbook.yml': line 61, column 11, > but may\nbe elsewhere in the file depending on the exact syntax > problem.\n\nThe offending line appears to be:\n\n block:\n - name: Prepare > each hypervisor in the deployment\n ^ here\n" > } > > PLAY RECAP > ********************************************************************* > hypervisor-01 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 > ignored=0 > hypervisor-02 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 > ignored=0 > hypervisor-03 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 > ignored=0 > hypervisor-04 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 > ignored=0 > localhost : ok=137 changed=15 unreachable=0 failed=0 skipped=77 rescued=0 > ignored=0 > > > On Wed, 29 Dec 2021, 17:00 Glenn Marcy, ***@***.***> wrote: > >> @tmlocher It looks like the details didn't >> make it to the github issue, the one you mentioned as failed as below >> and above mentioned error in your last two comments, so I'm not sure >> what the issue you have encountered was with the latest code in main. >> >> — >> Reply to this email directly, view it on GitHub >> , >> or unsubscribe >> >> . >> Triage notifications on the go with GitHub Mobile for iOS >> >> or Android >> . >> >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> > -- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland +41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

could you please attach your command line, the inventory file and the log from the first failure from the main branch. not sure what you mean by missing '$' since that's not a thing in ansible. the error seems to indicate that it couldn't find any hypervisor hosts in your inventory, which I suspect is an error in parsing your inventory. if you could also use -vvv on the command line that should provide some additional details (more verbose).

tmlocher commented 2 years ago

One frustrating experience! I do now understand why so many mid sized companies move away from Openshift, it is a PAIN. Current Status:

Reinstalled all machines virgin to Fedora35 (srv01-srv04)
Pulled head (carlos commit 24th Dec)
merged the above Inventory
amended hostnames
ran the install

Result: it runs to an error creating the contoller-01: TASK [kubeinit.kubeinit.kubeinit_libvirt : Set guest images facts]

task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:25 ok: [localhost -> hypervisor-01(srv01)] => { "ansible_facts": { "kubeinit_coreos_initrd": "fedora-coreos-34.20210904.3.0-live-initramfs.x86_64.img", "kubeinit_coreos_raw": "fedora-coreos-34.20210904.3.0-metal.x86_64.raw.xz", "kubeinit_coreos_rootfs": "fedora-coreos-34.20210904.3.0-live-rootfs.x86_64.img" }, "changed": false }

TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] *** task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:31 Loading collection ansible.netcommon from /home/manager/.ansible/collections/ansible_collections/ansible/netcommon Using module file /home/manager/.local/lib/python3.9/site-packages/ansible/modules/command.py Pipelining is enabled.

ESTABLISH SSH CONNECTION FOR USER: root SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ControlPath=/home/manager/.ansible/cp/fafae98888 srv01 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"'' On this it hangs, neither does it come back nor does it progress. At this point Controller-01 is created on srv01 (hypervisor-01), it puzzles me as it does not give me an error as it would usually do. This is the inventory: # # Common variables for the inventory # [all:vars] # # Internal variables # ansible_python_interpreter=/usr/bin/python3 ansible_ssh_pipelining=True ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new' # # Inventory variables # # # The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }} # You can override this by setting a specific value in kubeinit_inventory_cluster_name # kubeinit_inventory_cluster_name=mycluster kubeinit_inventory_cluster_domain=kubeinit.local kubeinit_inventory_network_name=kimgtnet0 kubeinit_inventory_network=10.0.0.0/24 kubeinit_inventory_gateway_offset=-2 kubeinit_inventory_nameserver_offset=-3 kubeinit_inventory_dhcp_start_offset=1 kubeinit_inventory_dhcp_end_offset=-4 kubeinit_inventory_controller_name_pattern=controller-%02d kubeinit_inventory_compute_name_pattern=compute-%02d kubeinit_inventory_post_deployment_services="none" # # Cluster definitions # # The networks you will use for your kubeinit clusters. The network name will be used # to create a libvirt network for the cluster guest vms. The network cidr will set # the range of addresses reserved for the cluster nodes. The gateway offset will be # used to select the gateway address within the range, a negative offset starts at the # end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254 # and gateway_offset=1 will select 10.0.0.1 as the address. Other offset attributes # follow the same convention. [kubeinit_networks] # kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # The clusters you are deploying using kubeinit. If there are no clusters defined here # then kubeinit will assume you are only using one cluster at a time and will use the # network defined by kubeinit_inventory_network. [kubeinit_clusters] # cluster0 network_name=kimgtnet0 # cluster1 network_name=kimgtnet1 # # If variables are defined in this section, they will have precedence when setting # kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name # # clusterXXX network_name=kimgtnetXXX post_deployment_services="none" # clusterYYY network_name=kimgtnetYYY post_deployment_services="none" # # Hosts definitions # # The cluster's guest machines can be distributed across mutiple hosts. By default they # will be deployed in the first Hypervisor. These hypervisors are activated and used # depending on how they are referenced in the kubeinit spec string. [hypervisor_hosts] hypervisor-01 ansible_host=srv01 hypervisor-02 ansible_host=srv02 hypervisor-03 ansible_host=srv03 hypervisor-04 ansible_host=srv04 # The inventory will have one host identified as the bastion host. By default, this role will # be assumed by the first hypervisor, which is the same behavior as the first commented out # line. The second commented out line would set the second hypervisor to be the bastion host. # The final commented out line would set the bastion host to be a different host that is not # being used as a hypervisor for the guests VMs for the clusters using this inventory. [bastion_host] # bastion target=hypervisor-01 # bastion target=hypervisor-02 # bastion ansible_host=bastion bastion target=hypervisor-04 # The inventory will have one host identified as the ovn-central host. By default, this role # will be assumed by the first hypervisor, which is the same behavior as the first commented # out line. The second commented out line would set the second hypervisor to be the ovn-central # host. [ovn_central_host] # ovn-central target=hypervisor-01 # ovn-central target=hypervisor-02 ovn-central target=hypervisor-04 # # Cluster node definitions # # Controller, compute, and extra nodes can be configured as virtual machines or using the # manually provisioned baremetal machines for the deployment. # Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes # at a time. [controller_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=25G ram=25165824 vcpus=8 maxvcpus=16 type=virtual target_order=hypervisor-01 [controller_nodes] controller-01 target=hypervisor-01 controller-02 target=hypervisor-02 controller-03 target=hypervisor-03 [compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=30G ram=16777216 vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [compute_nodes] compute-01 target=hypervisor-01 compute-02 target=hypervisor-04 compute-03 target=hypervisor-02 compute-04 target=hypervisor-04 compute-05 target=hypervisor-03 compute-06 target=hypervisor-04 [extra_nodes:vars] os={'cdk': 'ubuntu', 'okd': 'coreos'} disk=20G ram={'cdk': '8388608', 'okd': '16777216'} vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [extra_nodes] juju-controller distro=cdk bootstrap target=hypervisor-04 distro=okd # Service nodes are a set of service containers sharing the same pod network. # There is an implicit 'provision' service container which will use a base os # container image based upon the service_nodes:vars os attribute. [service_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'} target_order=hypervisor-01 [service_nodes] service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus ***@***.***:~/repos/kubeinit$ Open to any suggestions! BTW how can I join the slack channel? I somehow could not access that, maybe that is the easier way than the ticket? t On Thu, 30 Dec 2021 at 07:05, Glenn Marcy ***@***.***> wrote: > could you please attach your command line, the inventory file and the log > from the first failure from the main branch. not sure what you mean by > missing '$' since that's not a thing in ansible. the error seems to > indicate that it couldn't find any hypervisor hosts in your inventory, > which I suspect is an error in parsing your inventory. if you could also > use -vvv on the command line that should provide some additional details > (more verbose). > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > Triage notifications on the go with GitHub Mobile for iOS > > or Android > . > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland +41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

This part I don't quite understand: set -o pipefail

If you use the kernel args to deploy the machine

Is not possible to render the template as --print-xml > vm.xml

If so then is good to have it like 'virsh define vm.xml'

kernel_args=$(echo "$kernel_args_aux" | paste -sd "" -) kernel_args_aux='initrd=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_initrd }} ip=dhcp nameserver={{ kubeinit_bind_service_address }} rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.insecure=yes coreos.inst.install_dev=/dev/vda coreos.inst.image_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_raw }} coreos.inst.ignition_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_ignition_name }}.ign coreos.live.rootfs_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_rootfs }}' kernel_args='initrd=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_initrd }} ip=dhcp nameserver={{ kubeinit_bind_service_address }} rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.insecure=yes coreos.inst.install_dev=/dev/vda coreos.inst.image_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_raw }} coreos.inst.ignition_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_ignition_name }}.ign coreos.live.rootfs_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_rootfs }}'

so you set the kernel args with the one line content of kernel_args_aux only to reset it to the same again, ... happy to learn what is behind this. for the time being I will fiddle with this. t

On Fri, 31 Dec 2021 at 16:37, Thomas Locher @.***> wrote:

One frustrating experience! I do now understand why so many mid sized companies move away from Openshift, it is a PAIN. Current Status:

Reinstalled all machines virgin to Fedora35 (srv01-srv04)

Pulled head (carlos commit 24th Dec)

merged the above Inventory

amended hostnames

ran the install

Result: it runs to an error creating the contoller-01: TASK [kubeinit.kubeinit.kubeinit_libvirt : Set guest images facts]

task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:25 ok: [localhost -> hypervisor-01(srv01)] => { "ansible_facts": { "kubeinit_coreos_initrd": "fedora-coreos-34.20210904.3.0-live-initramfs.x86_64.img", "kubeinit_coreos_raw": "fedora-coreos-34.20210904.3.0-metal.x86_64.raw.xz", "kubeinit_coreos_rootfs": "fedora-coreos-34.20210904.3.0-live-rootfs.x86_64.img" }, "changed": false }

TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] *** task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:31 Loading collection ansible.netcommon from /home/manager/.ansible/collections/ansible_collections/ansible/netcommon Using module file /home/manager/.local/lib/python3.9/site-packages/ansible/modules/command.py Pipelining is enabled.
ESTABLISH SSH CONNECTION FOR USER: root SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ControlPath=/home/manager/.ansible/cp/fafae98888 srv01 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"'' On this it hangs, neither does it come back nor does it progress. At this point Controller-01 is created on srv01 (hypervisor-01), it puzzles me as it does not give me an error as it would usually do. This is the inventory: # # Common variables for the inventory # [all:vars] # # Internal variables # ansible_python_interpreter=/usr/bin/python3 ansible_ssh_pipelining=True ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new' # # Inventory variables # # # The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }} # You can override this by setting a specific value in kubeinit_inventory_cluster_name # kubeinit_inventory_cluster_name=mycluster kubeinit_inventory_cluster_domain=kubeinit.local kubeinit_inventory_network_name=kimgtnet0 kubeinit_inventory_network=10.0.0.0/24 kubeinit_inventory_gateway_offset=-2 kubeinit_inventory_nameserver_offset=-3 kubeinit_inventory_dhcp_start_offset=1 kubeinit_inventory_dhcp_end_offset=-4 kubeinit_inventory_controller_name_pattern=controller-%02d kubeinit_inventory_compute_name_pattern=compute-%02d kubeinit_inventory_post_deployment_services="none" # # Cluster definitions # # The networks you will use for your kubeinit clusters. The network name will be used # to create a libvirt network for the cluster guest vms. The network cidr will set # the range of addresses reserved for the cluster nodes. The gateway offset will be # used to select the gateway address within the range, a negative offset starts at the # end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254 # and gateway_offset=1 will select 10.0.0.1 as the address. Other offset attributes # follow the same convention. [kubeinit_networks] # kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # The clusters you are deploying using kubeinit. If there are no clusters defined here # then kubeinit will assume you are only using one cluster at a time and will use the # network defined by kubeinit_inventory_network. [kubeinit_clusters] # cluster0 network_name=kimgtnet0 # cluster1 network_name=kimgtnet1 # # If variables are defined in this section, they will have precedence when setting # kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name # # clusterXXX network_name=kimgtnetXXX post_deployment_services="none" # clusterYYY network_name=kimgtnetYYY post_deployment_services="none" # # Hosts definitions # # The cluster's guest machines can be distributed across mutiple hosts. By default they # will be deployed in the first Hypervisor. These hypervisors are activated and used # depending on how they are referenced in the kubeinit spec string. [hypervisor_hosts] hypervisor-01 ansible_host=srv01 hypervisor-02 ansible_host=srv02 hypervisor-03 ansible_host=srv03 hypervisor-04 ansible_host=srv04 # The inventory will have one host identified as the bastion host. By default, this role will # be assumed by the first hypervisor, which is the same behavior as the first commented out # line. The second commented out line would set the second hypervisor to be the bastion host. # The final commented out line would set the bastion host to be a different host that is not # being used as a hypervisor for the guests VMs for the clusters using this inventory. [bastion_host] # bastion target=hypervisor-01 # bastion target=hypervisor-02 # bastion ansible_host=bastion bastion target=hypervisor-04 # The inventory will have one host identified as the ovn-central host. By default, this role # will be assumed by the first hypervisor, which is the same behavior as the first commented # out line. The second commented out line would set the second hypervisor to be the ovn-central # host. [ovn_central_host] # ovn-central target=hypervisor-01 # ovn-central target=hypervisor-02 ovn-central target=hypervisor-04 # # Cluster node definitions # # Controller, compute, and extra nodes can be configured as virtual machines or using the # manually provisioned baremetal machines for the deployment. # Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes # at a time. [controller_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=25G ram=25165824 vcpus=8 maxvcpus=16 type=virtual target_order=hypervisor-01 [controller_nodes] controller-01 target=hypervisor-01 controller-02 target=hypervisor-02 controller-03 target=hypervisor-03 [compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=30G ram=16777216 vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [compute_nodes] compute-01 target=hypervisor-01 compute-02 target=hypervisor-04 compute-03 target=hypervisor-02 compute-04 target=hypervisor-04 compute-05 target=hypervisor-03 compute-06 target=hypervisor-04 [extra_nodes:vars] os={'cdk': 'ubuntu', 'okd': 'coreos'} disk=20G ram={'cdk': '8388608', 'okd': '16777216'} vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [extra_nodes] juju-controller distro=cdk bootstrap target=hypervisor-04 distro=okd # Service nodes are a set of service containers sharing the same pod network. # There is an implicit 'provision' service container which will use a base os # container image based upon the service_nodes:vars os attribute. [service_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'} target_order=hypervisor-01 [service_nodes] service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus ***@***.***:~/repos/kubeinit$ Open to any suggestions! BTW how can I join the slack channel? I somehow could not access that, maybe that is the easier way than the ticket? t On Thu, 30 Dec 2021 at 07:05, Glenn Marcy ***@***.***> wrote: > could you please attach your command line, the inventory file and the log > from the first failure from the main branch. not sure what you mean by > missing '$' since that's not a thing in ansible. the error seems to > indicate that it couldn't find any hypervisor hosts in your inventory, > which I suspect is an error in parsing your inventory. if you could also > use -vvv on the command line that should provide some additional details > (more verbose). > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > Triage notifications on the go with GitHub Mobile for iOS > > or Android > . > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland +41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

here also the log on SRV01 (that's the one that never returns) @.*** qemu]# cat okdcluster-controller-01.log 2022-01-02 11:28:17.465+0000: starting up libvirt version: 7.6.0, package: 5.fc35 (Fedora Project, 2021-12-16-17:57:31, ), qemu version: 6.1.0qemu-6.1.0-10.fc35, kernel: 5.15.11-200.fc35.x86_64, hostname: srv01.tmlocher.org LC_ALL=C \ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \ HOME=/var/lib/libvirt/qemu/domain-1-okdcluster-controlle \ XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-okdcluster-controlle/.local/share \ XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-okdcluster-controlle/.cache \ XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-okdcluster-controlle/.config \ /usr/bin/qemu-system-x86_64 \ -name guest=okdcluster-controller-01,debug-threads=on \ -S \ -object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-okdcluster-controlle/master-key.aes"}' \ -machine pc-q35-6.1,accel=kvm,usb=off,vmport=off,dump-guest-core=off,memory-backend=pc.ram \ -cpu SandyBridge-IBRS,vme=on,ss=on,vmx=on,pdcm=on,pcid=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on,pdpe1gb=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,skip-l1dfl-vmentry=on,pschange-mc-no=on \ -m 24576 \ -object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":25769803776,"host-nodes":[0,1],"policy":"bind"}' \ -overcommit mem-lock=off \ -smp 8,maxcpus=16,sockets=16,cores=1,threads=1 \ -uuid 3aff815b-9a03-4c02-bec5-d4d5b73dcd80 \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=30,server=on,wait=off \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc,driftfix=slew \ -global kvm-pit.lost_tick_policy=delay \ -no-hpet \ -no-reboot \ -global ICH9-LPC.disable_s3=1 \ -global ICH9-LPC.disable_s4=1 \ -boot strict=on \ -kernel /var/lib/libvirt/boot/virtinst-q4i_we_u-fedora-coreos-34.20210904.3.0-live-kernel-x86_64 \ -initrd /var/lib/libvirt/boot/virtinst-odh2itgq-fedora-coreos-34.20210904.3.0-live-initramfs.x86_64.img \ -append 'initrd= http://10.0.0.253:8080/kubeinit/okd4/fedora-coreos-34.20210904.3.0-live-initramfs.x86_64.img ip=dhcp nameserver=10.0.0.253 rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.insecure=yes coreos.inst.install_dev=/dev/vda coreos.inst.image_url= http://10.0.0.253:8080/kubeinit/okd4/fedora-coreos-34.20210904.3.0-metal.x86_64.raw.xz coreos.inst.ignition_url=http://10.0.0.253:8080/kubeinit/okd4/master.ign coreos.live.rootfs_url= http://10.0.0.253:8080/kubeinit/okd4/fedora-coreos-34.20210904.3.0-live-rootfs.x86_64.img' \ -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \ -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \ -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 \ -blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/okdcluster-controller-01.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \ -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:64:9e:79,bus=pci.1,addr=0x0 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -chardev socket,id=charchannel0,fd=37,server=on,wait=off \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -chardev spicevmc,id=charchannel1,name=vdagent \ -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 \ -device usb-tablet,id=input0,bus=usb.0,port=1 \ -audiodev id=audio1,driver=spice \ -spice port=5900,addr=127.0.0.1,disable-ticketing=on,image-compression=off,seamless-migration=on \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 \ -device ich9-intel-hda,id=sound0,bus=pcie.0,addr=0x1b \ -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0,audiodev=audio1 \ -chardev spicevmc,id=charredir0,name=usbredir \ -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 \ -chardev spicevmc,id=charredir1,name=usbredir \ -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 \ -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 \ -object '{"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}' \ -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on char device redirected to /dev/pts/0 (label charserial0) qxl_send_events: spice-server bug: guest stopped, ignoring

@.*** qemu]# virsh list --all Id Name State

1 okdcluster-controller-01 running

BUT this I can see [ 2103.350542] coreos-livepxe-rootfs[634]: Retrying in 5s... [ 2108.356771] coreos-livepxe-rootfs[1462]: curl: (7) Couldn't connect to server [ 2108.362334] coreos-livepxe-rootfs[634]: Couldn't establish connectivity with the server specified by coreos.live.rootfs_url=

so the URL seems unset, let me see if I can find that t

On Sat, 1 Jan 2022 at 20:13, Thomas Locher @.***> wrote:

This part I don't quite understand: set -o pipefail

If you use the kernel args to deploy the machine

Is not possible to render the template as --print-xml > vm.xml

If so then is good to have it like 'virsh define vm.xml'

kernel_args=$(echo "$kernel_args_aux" | paste -sd "" -) kernel_args_aux='initrd=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_initrd }} ip=dhcp nameserver={{ kubeinit_bind_service_address }} rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.insecure=yes coreos.inst.install_dev=/dev/vda coreos.inst.image_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_raw }} coreos.inst.ignition_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_ignition_name }}.ign coreos.live.rootfs_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_rootfs }}' kernel_args='initrd=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_initrd }} ip=dhcp nameserver={{ kubeinit_bind_service_address }} rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.insecure=yes coreos.inst.install_dev=/dev/vda coreos.inst.image_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_raw }} coreos.inst.ignition_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_ignition_name }}.ign coreos.live.rootfs_url=http://{{ kubeinit_apache_service_address }}:8080/kubeinit/okd4/{{ kubeinit_coreos_rootfs }}'

so you set the kernel args with the one line content of kernel_args_aux only to reset it to the same again, ... happy to learn what is behind this. for the time being I will fiddle with this. t

On Fri, 31 Dec 2021 at 16:37, Thomas Locher @.***> wrote:

One frustrating experience! I do now understand why so many mid sized companies move away from Openshift, it is a PAIN. Current Status:

Reinstalled all machines virgin to Fedora35 (srv01-srv04)

Pulled head (carlos commit 24th Dec)

merged the above Inventory

amended hostnames

ran the install

Result: it runs to an error creating the contoller-01: TASK [kubeinit.kubeinit.kubeinit_libvirt : Set guest images facts]

task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:25 ok: [localhost -> hypervisor-01(srv01)] => { "ansible_facts": { "kubeinit_coreos_initrd": "fedora-coreos-34.20210904.3.0-live-initramfs.x86_64.img", "kubeinit_coreos_raw": "fedora-coreos-34.20210904.3.0-metal.x86_64.raw.xz", "kubeinit_coreos_rootfs": "fedora-coreos-34.20210904.3.0-live-rootfs.x86_64.img" }, "changed": false }

TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] *** task path: /home/manager/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/deploy_coreos_guest.yml:31 Loading collection ansible.netcommon from /home/manager/.ansible/collections/ansible_collections/ansible/netcommon Using module file /home/manager/.local/lib/python3.9/site-packages/ansible/modules/command.py Pipelining is enabled.
ESTABLISH SSH CONNECTION FOR USER: root SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -o ControlPath=/home/manager/.ansible/cp/fafae98888 srv01 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"'' On this it hangs, neither does it come back nor does it progress. At this point Controller-01 is created on srv01 (hypervisor-01), it puzzles me as it does not give me an error as it would usually do. This is the inventory: # # Common variables for the inventory # [all:vars] # # Internal variables # ansible_python_interpreter=/usr/bin/python3 ansible_ssh_pipelining=True ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new' # # Inventory variables # # # The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }} # You can override this by setting a specific value in kubeinit_inventory_cluster_name # kubeinit_inventory_cluster_name=mycluster kubeinit_inventory_cluster_domain=kubeinit.local kubeinit_inventory_network_name=kimgtnet0 kubeinit_inventory_network=10.0.0.0/24 kubeinit_inventory_gateway_offset=-2 kubeinit_inventory_nameserver_offset=-3 kubeinit_inventory_dhcp_start_offset=1 kubeinit_inventory_dhcp_end_offset=-4 kubeinit_inventory_controller_name_pattern=controller-%02d kubeinit_inventory_compute_name_pattern=compute-%02d kubeinit_inventory_post_deployment_services="none" # # Cluster definitions # # The networks you will use for your kubeinit clusters. The network name will be used # to create a libvirt network for the cluster guest vms. The network cidr will set # the range of addresses reserved for the cluster nodes. The gateway offset will be # used to select the gateway address within the range, a negative offset starts at the # end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254 # and gateway_offset=1 will select 10.0.0.1 as the address. Other offset attributes # follow the same convention. [kubeinit_networks] # kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4 # The clusters you are deploying using kubeinit. If there are no clusters defined here # then kubeinit will assume you are only using one cluster at a time and will use the # network defined by kubeinit_inventory_network. [kubeinit_clusters] # cluster0 network_name=kimgtnet0 # cluster1 network_name=kimgtnet1 # # If variables are defined in this section, they will have precedence when setting # kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name # # clusterXXX network_name=kimgtnetXXX post_deployment_services="none" # clusterYYY network_name=kimgtnetYYY post_deployment_services="none" # # Hosts definitions # # The cluster's guest machines can be distributed across mutiple hosts. By default they # will be deployed in the first Hypervisor. These hypervisors are activated and used # depending on how they are referenced in the kubeinit spec string. [hypervisor_hosts] hypervisor-01 ansible_host=srv01 hypervisor-02 ansible_host=srv02 hypervisor-03 ansible_host=srv03 hypervisor-04 ansible_host=srv04 # The inventory will have one host identified as the bastion host. By default, this role will # be assumed by the first hypervisor, which is the same behavior as the first commented out # line. The second commented out line would set the second hypervisor to be the bastion host. # The final commented out line would set the bastion host to be a different host that is not # being used as a hypervisor for the guests VMs for the clusters using this inventory. [bastion_host] # bastion target=hypervisor-01 # bastion target=hypervisor-02 # bastion ansible_host=bastion bastion target=hypervisor-04 # The inventory will have one host identified as the ovn-central host. By default, this role # will be assumed by the first hypervisor, which is the same behavior as the first commented # out line. The second commented out line would set the second hypervisor to be the ovn-central # host. [ovn_central_host] # ovn-central target=hypervisor-01 # ovn-central target=hypervisor-02 ovn-central target=hypervisor-04 # # Cluster node definitions # # Controller, compute, and extra nodes can be configured as virtual machines or using the # manually provisioned baremetal machines for the deployment. # Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes # at a time. [controller_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=25G ram=25165824 vcpus=8 maxvcpus=16 type=virtual target_order=hypervisor-01 [controller_nodes] controller-01 target=hypervisor-01 controller-02 target=hypervisor-02 controller-03 target=hypervisor-03 [compute_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'} disk=30G ram=16777216 vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [compute_nodes] compute-01 target=hypervisor-01 compute-02 target=hypervisor-04 compute-03 target=hypervisor-02 compute-04 target=hypervisor-04 compute-05 target=hypervisor-03 compute-06 target=hypervisor-04 [extra_nodes:vars] os={'cdk': 'ubuntu', 'okd': 'coreos'} disk=20G ram={'cdk': '8388608', 'okd': '16777216'} vcpus=8 maxvcpus=16 type=virtual target_order="hypervisor-02,hypervisor-01" [extra_nodes] juju-controller distro=cdk bootstrap target=hypervisor-04 distro=okd # Service nodes are a set of service containers sharing the same pod network. # There is an implicit 'provision' service container which will use a base os # container image based upon the service_nodes:vars os attribute. [service_nodes:vars] os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'} target_order=hypervisor-01 [service_nodes] service target=hypervisor-04 services="bind,dnsmasq,haproxy,apache,registry" # nexus ***@***.***:~/repos/kubeinit$ Open to any suggestions! BTW how can I join the slack channel? I somehow could not access that, maybe that is the easier way than the ticket? t On Thu, 30 Dec 2021 at 07:05, Glenn Marcy ***@***.***> wrote: > could you please attach your command line, the inventory file and the > log from the first failure from the main branch. not sure what you mean by > missing '$' since that's not a thing in ansible. the error seems to > indicate that it couldn't find any hypervisor hosts in your inventory, > which I suspect is an error in parsing your inventory. if you could also > use -vvv on the command line that should provide some additional details > (more verbose). > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > Triage notifications on the go with GitHub Mobile for iOS > > or Android > . > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland +41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

ccamacho commented 2 years ago

@tmlocher is a pain your environment is not being deployed correctly, once in a while I have issues with the hypervisors stuck.

Try to cleanup the environment passing to your deployment command -e kubeinit_stop_after_task=task-cleanup-hypervisors

tmlocher commented 2 years ago

Thank you Carlos, running it just now. Please take my apologies being a pain, I am just interested to get this flying (without becoming a master of all underlying techs) What is frustrating is less that I am now so long into getting a OKD4 cluster set up but that I am missing some crucial documentation: you have built in breaks and hence I could get to certain points and move on, or restart off these. but I will have to read through many, many files to figure out. you reference the docs for deals but there is little, ... if you give me bullet points I will happily work it through and write up the docs (1st phase: get environment into shape; 2nd set up network; 3rd add hypervisors 4th; check all is in place; 5th kick off okd install or something thereabout) it would maybe simplify figuring out where this is derailed) Is there a similar "restart" command? One of the pains I have is that it needs to mirror the repository time and time again (about 25mins) any chance to have a separate instance that "caches" these files so they need not be pulled from externally time and time again?

Regards thomas

On Mon, 3 Jan 2022 at 08:08, Carlos Camacho @.***> wrote:

@tmlocher https://github.com/tmlocher is a pain your environment is not being deployed correctly, once in a while I have issues with the hypervisors stuck.

Try to cleanup the environment passing to your deployment command -e kubeinit_stop_after_task=task-cleanup-hypervisors

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-1003904119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6KDDKB2EXVIVKCCO5TUUFDP3ANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

tmlocher commented 2 years ago

digging into this again:

clean machines, new checkout, ... runs up to TASK [kubeinit.kubeinit.kubeinit_libvirt : Create VM definition for controller-01] ***

this points to the temp directory that is on the machine where the playbook is situated, but for some reason this is a link that goes nowhere: srw------- 1 manager manager 0 Jan 6 21:00 fafae98888

As a result it dies, ...

On Mon, 3 Jan 2022 at 08:44, Thomas Locher @.***> wrote:

Thank you Carlos, running it just now. Please take my apologies being a pain, I am just interested to get this flying (without becoming a master of all underlying techs) What is frustrating is less that I am now so long into getting a OKD4 cluster set up but that I am missing some crucial documentation: you have built in breaks and hence I could get to certain points and move on, or restart off these. but I will have to read through many, many files to figure out. you reference the docs for deals but there is little, ... if you give me bullet points I will happily work it through and write up the docs (1st phase: get environment into shape; 2nd set up network; 3rd add hypervisors 4th; check all is in place; 5th kick off okd install or something thereabout) it would maybe simplify figuring out where this is derailed) Is there a similar "restart" command? One of the pains I have is that it needs to mirror the repository time and time again (about 25mins) any chance to have a separate instance that "caches" these files so they need not be pulled from externally time and time again?

Regards thomas

On Mon, 3 Jan 2022 at 08:08, Carlos Camacho @.***> wrote:

@tmlocher https://github.com/tmlocher is a pain your environment is not being deployed correctly, once in a while I have issues with the hypervisors stuck.

Try to cleanup the environment passing to your deployment command -e kubeinit_stop_after_task=task-cleanup-hypervisors

— Reply to this email directly, view it on GitHub https://github.com/Kubeinit/kubeinit/issues/571#issuecomment-1003904119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYH6KDDKB2EXVIVKCCO5TUUFDP3ANCNFSM5KMBH2HQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

-- Thomas Locher Felsenburgstrasse 7 8712 Stäfa Schweiz/Switzerland

+41 79 175 23 53 Skype: tmlocher; LinkedIn Member

gmarcy commented 2 years ago

that would explain some of the confusion, since I haven't been able to reproduce with a similar backend configuration. so it's not a problem with the machines you are deploying to, but with the ansible controller where you are running the playbook to perform the deployment. one thing that you could try is podman build -t kubeinit/kubeinit . on the controller to build a container and then run the cluster deployment from that container.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days