Wow! Just a starter question.

disposab1e commented 4 years ago

I'm on my way for my onw lab...

First of all! Great work!

I will start with a linux root server I have rented at a public hosting provider.

The specs: AMD Ryzen™ 9 3900 12-Core / Simultaneous Multithreading 2 x 1,92 TB NVMe SSD Datacenter Edition 128 GB DDR4 ECC RAM Centos 7.8

Because this server is facing the world I just have a ssh port available to the outside world and I'm not willing to open any other ports :-)

Do you think it makes sense to have the bastion host (DHCP, LB, DNS, PXE, TFTP, RPMS) together with 3xControl Plane, 3xWorker based on libvirt on the same host?

cgruver commented 4 years ago

Yes, you should be fine. 12 SMT cores == 24 vCPUs. You should not have an issue with oversubscription.

You will need to modify the provided guest_inventory file to reflect that everything is running on the same kvm-host.

./Provisioning/guest_inventory/okd4_lab

# KVM_HOST_NODE,GUEST_HOSTNAME,MEMORY,CPU,ROOT_VOL,DATA_VOL,NUM_OF_NICS,ROLE,VBMC_PORT
kvm-host01,okd4-bootstrap,12288,4,50,0,1,BOOTSTRAP,6229
kvm-host01,okd4-master-0,16384,4,100,0,1,MASTER,6230
kvm-host01,okd4-master-1,16384,4,100,0,1,MASTER,6231
kvm-host01,okd4-master-2,16384,4,100,0,1,MASTER,6232
kvm-host01,okd4-worker-0,16384,4,100,0,1,WORKER,6233
kvm-host01,okd4-worker-1,16384,4,100,0,1,WORKER,6234
kvm-host01,okd4-worker-2,16384,4,100,0,1,WORKER,6235

Replace kvm-host01 with the hostname of your rented server. I also adjusted the amount of RAM to accommodate your specs and leave plenty of RAM for the bastion host and HA Proxy host.

I recently made a significant refactoring the these instructions, so there's probably some bugs or missing steps. Let me know any issues that you encounter and I'll make corrections to the documentation.

Have fun!

cgruver commented 4 years ago

I've also created a similar guide for setting up a single node cluster, similar to code ready containers, but this one is installed more like a real multi-node cluster and you can access it from your local network. https://github.com/cgruver/okd4-single-node-cluster

disposab1e commented 4 years ago

Great! Thank you very much for your detailed reply! I will walk through all of your verry detailed guides! Again... great work! Like the picture of your home lab :-)

I will give you feedback!

Yes I saw your guide about single node cluster. I personally struggled building my own crc with snc (not with your guide) but I think I will give it a try later on.

I was reading you work on Tekton pipelines. I have some experience with Tekton on OpenShift/Kubernetes and especially in combintation with Argo CD and Seleaed Secrets. The last days I was experimenting with crc and Red Hat Adanced Cluster Management with management of of Minishift and Minikube. I'm pleased to share all I know if this is from some interest to you.

disposab1e commented 4 years ago

Playing around with my thoughts... I don't think it's a good idea to bind all servcies to the host IP facing the internet. Cause I'm using firewalld I think this will be challenging. Isnt it better from a security perspective to implement the bastion host as a kvm too?

What do you think?

cgruver commented 4 years ago

Yes. I would let the machine that you rented just be a kvm-host.

Your guest VMs will be:

bastion okd4-lb01 (HA Proxy load balancer) okd4-bootstrap (Temporary) okd4-master-0 okd4-master-1 okd4-master-1 okd4-worker-0 okd4-worker-1 okd4-worker-2

The bastion server only needs 4-8 GB of RAM, the HA Proxy server would be fine with 1-2GB of RAM.

Give the okd4 master and worker nodes 16GB each.

disposab1e commented 4 years ago

@cgruver thx again!! I s there any reason why you don't go with terraform and kvm provider?

cgruver commented 4 years ago

Not really, just using the tools in my box that I'm most familiar with. ;-)

The OpenShift installer uses terraform when building the nodes. For me to build out the underlying infrastructure, it was just easier to go with plain old Libvirt.

disposab1e commented 4 years ago

This was my guess :-) Terraform is also not in my toolbox, so I will give it a try to provison the bastion vm. Starting with packer.io for kickstart and handing over to terraform. Works like a charme so far.

disposab1e commented 4 years ago

Problems with iPXE... why do you start tftp with xinetd? Have problems to download kickstart files from tftp. vmlinux and initrd.img without problems. But than no chance to download ks.cfg and LiveOS/squashfs.img. I'm going the tftp not the http way like you.. any ideas?

I can tftp from bastion host these files without any problem. But from different host no chance :-(

Firewall seems ok, SELinux to permissve and up/down allowed.

cgruver commented 4 years ago

I used xinetd just because that's how I set it up the last time I used it. Definitely open to a better way. As long as your files are under /var/lib/tftpboot/... with 644 permissions, you should not have any issues serving with tftp. Maybe double check your file permissions?

Honestly, I'm using the OpenWRT router that I mentioned in the docs. It was a breeze to set up. I wrote the TFTP/DHCP instructions for a friend of mine who does not have an OpenWRT router.

He is in the middle of building his own lab as well, so I'm fixing issues with the documentation as he completes his build. I really appreciate any feedback on the docs, I'm sure there are still some issues.

disposab1e commented 4 years ago

bootstrap phase: bootstrap comming up: fine! master comming up: fine! worker coming up: First getting api-int with "Internal Server Error", than switching to "Network is unreachable" :-( Completely loosing their network configuration.. no ping possible. Force of and start of worker vms... network back but no worker nodes provisioned. master coming up finr and ready... no workers to see :-( Yes: bootstrap is removed from haproxy with latest and previous beta

cgruver commented 4 years ago

I have a theory.

Are the masters still up?

If so, start the workers. Wait for them to start trying to join the cluster.

Run oc get csr

If you see CSRs with a Pending state, that is the problem. You will need to approve the CSRs for the workers.

oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

You may have to run this several times to approve all of the CSRs for the workers.

After completing that, you need to follow the instructions here: https://github.com/cgruver/okd4-upi-lab-setup/blob/master/docs/pages/InfraNodes.md

Otherwise, your Ingress routers may end up on Worker nodes and you will lose connectivity to your cluster.

disposab1e commented 4 years ago

Thx for reply!! I did this one time... will repeat! :-)

disposab1e commented 4 years ago

btw... tftp with inetd is totally ok! Will give more response when I reached final state.

disposab1e commented 4 years ago

But why do the workers completely forget their network configuration? This I never saw before on a linux :-)

disposab1e commented 4 years ago

One thing with lb node:

You can set selinux in Kickstart file to permissive:

selinux --permissive

This is all what I did to get haproxy running.

cgruver commented 4 years ago

I have not seen the workers forget their network config. Which version of Fedora CoreOS are you using? There was an issue in older releases.

cgruver commented 4 years ago

Hey,

selinux for the LB node should have been configured by the firstboot script.

setenforce 0
systemctl start haproxy
systemctl enable haproxy
grep haproxy /var/log/audit/audit.log | audit2allow -M haproxy
semodule -i haproxy.pp

Did that not work for you?

disposab1e commented 4 years ago

Yes it was working, I saw your solution... but why not go with the shortcut in Kickstart?

disposab1e commented 4 years ago

I use latest from stable stream: 31.20200505.3.0 What di you use?

cgruver commented 4 years ago

Ah, good question on selinux. The solution that I put in the firstboot leaves selinux in an enforcing mode. It just configures it to allow HA Proxy to use ports that are otherwise restricted.

I don't want selinux to be permanently permissive.

The reason that I disable selinux between the kickstart and the firstboot is so that the firstboot script can run as root. Then I put selinux back into enforcing mode.

I did it by manipulating the config file vs. CLI. CLI would work as well. I just wanted to ensure that the original selinux configuration is restored after the firstboot process completes.

disposab1e commented 4 years ago

This again was my guess :-) It's more clean I totally agree! For my first shot I tried to remove as much complexity as possible to get started cause most of it is not really in my toolbox so far :-) but I'm learning :-)

disposab1e commented 4 years ago

A few minutes before new install... I will see with csr...

disposab1e commented 4 years ago

One question... is it possible to first start only the bootstrap and master vms.... and than when the bootstrap is deleted the workers?

cgruver commented 4 years ago

Yes! That is actually how I deploy these clusters.

I have been meaning to modify the documentation to reflect that.

It is a smoother install.

disposab1e commented 4 years ago

One step further:

start: bootstrap start: master wait bootstrapping completed remove bootstrap from haproxy start: worker monitor and approve csr's :-) approva approve ..... now... 2 workers ready... one worker missing :-)

disposab1e commented 4 years ago

restarted missing worker... schwups new csr... quickly approve... worker saying hello to me :-) cluster up

BUT!!! I have one worker node with name : "localhost" ... ups....

disposab1e commented 4 years ago

this localhost was coming after restarting the missing worker

disposab1e commented 4 years ago

Now a good question... how to automate this csr approvels?? :-))

disposab1e commented 4 years ago

In 3.11 I know there was a possibilty to configure the cluster for automated approval of csr... is their no chance in 4.4?

disposab1e commented 4 years ago

I tried this but without luck... still manual approval needed.

disposab1e commented 4 years ago

Now I was quick enough... together with your hint regarding routers all is fine :-)

Now I will look how to automate....

disposab1e commented 4 years ago

Do you use ceph for your registry?

disposab1e commented 4 years ago

Better and without need of jq:

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

cgruver commented 4 years ago

The manual approval of CSR is an artifact of UPI installations. If it's IPI (Intaller Provisioned Infrastructure) The it is automated because the installer kicked of the build and therefore trusts the machine trying to join the cluster.

The manual approval is to prevent rogue machines from joining your cluster.

I do use Ceph for my registry.

Appreciate the csr command without jq! nice!

cgruver commented 4 years ago

Hey! The localhost thing is an issue with FCOS. It's fixed in the testing stream, but may not be in stable yet. You can fix it by doing this across your whole cluster:

for i in 0 1 2 ; do ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@okd4-master-${i}.${LAB_DOMAIN} "sudo hostnamectl set-hostname okd4-master-${i}.my.domain.org && sudo shutdown -r now"; done
for i in 0 1 2 ; do ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@okd4-worker-${i}.${LAB_DOMAIN} "sudo hostnamectl set-hostname okd4-worker-${i}.my.domain.org && sudo shutdown -r now"; done

Substitute appropriate variables and hostnames for your cluster.

disposab1e commented 4 years ago

So if I understand you correctly... with UPI (our method) we HAVE to manual approve.. correct?

disposab1e commented 4 years ago

cool! thx! You're so helpful! Hope I can give something back in the future!

disposab1e commented 4 years ago

Wrong Link to Ceph at the bottom of this page

cgruver commented 4 years ago

Oops... fixed it.

cgruver commented 4 years ago

Hey, you should join this Slack channel to get help from folks who know a LOT more than me.

https://kubernetes.slack.com/archives/C6AD6JM17

You can also join us at the OKD working group:

https://github.com/openshift/okd

disposab1e commented 4 years ago

@slack joined :-)

cgruver commented 4 years ago

Let me know if you run into any other issues.

Good luck!

cgruver / okd4-upi-lab-setup

Wow! Just a starter question. #4