burmilla / os

Tiny Linux distro that runs the entire OS as Docker containers
https://burmillaos.org
Apache License 2.0
210 stars 13 forks source link

VMware disk won't properly come online in the OS #149

Closed ArgonV closed 1 year ago

ArgonV commented 1 year ago

BurmillaOS Version: (ros os version)

v1.9.6

Where are you running BurmillaOS? (docker-machine, AWS, GCE, baremetal, etc.)

VMware vSphere datacenter

Which processor architecture you are using?

Intel Xeon

Do you use some extra hardware? (GPU, etc)?

No

Which console you use (default, ubuntu, centos, etc..)

Default

Do you use some service(s) which are not enabled by default?

No

Have you installed some extra tools to console?

VMware Tools

Do you use some other customizations?

Network config DHCP on boot

I am using the VMware ISO and cannot get the disk drive to properly come up. The VM in vSphere shows that the data disk is attached to the VM, but when I go to the console for BurmillaOS and do a df -h, I don't see the 20GB disk. I am using a node template to boot-strap the node into VMware. The networking, CPU and memory config options are properly being set for the VM.

olljanat commented 1 year ago

So you have multiple disks connected to VM? Please share your cloud-init like it is on issue template as data disks of course does not works without proper configuration https://burmillaos.org/docs/storage/additional-mounts/

You also might way to check real world examples from https://github.com/burmilla/os/issues/6

ArgonV commented 1 year ago

I just have the one hard disk, along with the ISO mounted as a CD/DVD drive.

Apologies, I missed the cloud-init part, there is no way to copy txt from the console so here is a screenshot:

cloud-init

And as a reference, here is my cloud-init from a RancherOS host with the disk drive properly mounted:

cloud-init-ros

Notice at the end there are the disks, however I am not adding those to my cloud-init. They are supposed to be getting that from the Rancher boot-strap process in the node template. So my guess is the STATE is not being picked up somehow?

olljanat commented 1 year ago

Did you tried to do installation? If I remember right df -h only prints info about mounted volumes. Not about empty disks.

ArgonV commented 1 year ago

I have been trying to get the disk just to show up, formatted and mounted at this point with no success, using cloud-init yaml at a url:

#cloud-config
mounts:
- ["/dev/sda", "/mnt/test", "ext4", ""]
rancher:
  sysctl:
    vm.max_map_count: 262144
  state:
    autoformat:
    - /dev/sda
    - /dev/vda

I can see that /mnt/test is there, but it is not peristent. fdisk -l shows me the device, but df -h doesn't show that it's formatted.

ArgonV commented 1 year ago

If I run mkfs.ext4 /dev/sda manually and reboot - I see that it's there. So why isn't it auto-formatting working on the initial boot with my above cloud-config?

olljanat commented 1 year ago

RancherOS did contain huge number of ready made installation medias. Sounds that you have been using their rancheros-vmware-autoformat.iso version.

He we purposely limit number of medias to minimum based on feedback which was got from #6 and those auto format medias are one of those which got dropped out from options. You can of course still fill your use case to there and in case we found others who have need for that we can consider re-adding it.

How it works now is that if you want automate installation on VMware you can use guestinfo field cloud-init.config.data for that. More about those in https://burmillaos.org/docs/installation/cloud/vmware-esxi/

and here is real world example how configure it with Terraform:

guestinfo.cloud-init.config.data = <<EOD
#!/bin/bash
(cat << EOF
#cloud-init
runcmd:
- ["mount", "-t", "ext4", "/dev/sdb", "/var/lib/docker"]
rancher:
  sysctl:
    vm.max_map_count: 262144
ssh_authorized_keys:
  - ${var.rancher_public_key}
EOF
)> cloud-init.yml
if ! blkid | grep -q "RANCHER_STATE"; then
 sudo ros install -d /dev/sda --no-reboot -c cloud-init.yml
 if ! blkid | grep -q "USER_DOCKER"; then
  sudo mkfs.ext4 /dev/sdb -L USER_DOCKER
 fi
 sudo reboot
else
 echo "already installed"
fi
EOD

Alternative you can create VMware template by doing installation like this:

#!/bin/bash
echo "Intalling to disk" > /dev/tty1
ros install -f -d /dev/sda --no-reboot --debug --append "console=tty1 console=ttyS0,115200n8 printk.devkmsg=on rancher.autologin=ttyS0"
halt -P

and marking that first VM as template and then just create other VMs based on it.

ArgonV commented 1 year ago

Thank you for the update.

I don't currently utilize Terraform, so I'm trying to pass that in via guestinfo.cloud-init.config.data and guestinfo.cloud-init.data.encoding with no luck so far:

Screen Shot 2023-01-17 at 11 05 07 AM
olljanat commented 1 year ago

Cloud init should write log to /var/log Syntax which you are using looks correct if that is valid base64 string and its content is valid (correctly formulated script, using LF instead CRLF, etc...)

ArgonV commented 1 year ago

I used your script above, everything after guestinfo.cloud-init.config.data = and pasted it into https://www.base64encode.org with the LF option set.

Not seeing a cloud init logfile, is it named something non-obvious?

ArgonV commented 1 year ago

Ah, found it under the boot directory in /var/log named cloud-init-execute/save.log Checking those...

ArgonV commented 1 year ago

Here we go: Screen Shot 2023-01-17 at 12 17 52 PM

"Unrecognized user-data" Trying to see what's up there.

olljanat commented 1 year ago

You need skip those "EOD" lines. They are just Terraform syntax to define multi line string.

ArgonV commented 1 year ago

Ah I see, thanks. So my config is:

#!/bin/bash
(cat << EOF
#cloud-init
runcmd:
- ["mount", "-t", "ext4", "/dev/sda", "/var/lib/docker"]
rancher:
  sysctl:
    vm.max_map_count: 262144
EOF) > cloud-init.yml
if ! blkid | grep -q "RANCHER_STATE"; then
 sudo ros install -d /dev/sda --no-reboot -c cloud-init.yml
 if ! blkid | grep -q "USER_DOCKER"; then
  sudo mkfs.ext4 /dev/vda -L USER_DOCKER
 fi
 sudo reboot
else
 echo "already installed"
fi

That got me past the "Unrecognized user-data" error. Now at the end of the cloud-init-save.log file I see this error msg: "Failed to run command [wpa_cli term

olljanat commented 1 year ago

wpa_cli is related to WLAN configuration. Should be safe to ignore in VMware.

ArgonV commented 1 year ago

Thanks, sadly I'm still not seeing the drive. Do I need to keep my initial cloud-init yaml at a URL also?

olljanat commented 1 year ago

Two things to check.

  1. Your cloud-init is invalid as you cannot use same disk as mount and install target. For simplicity use this instead of:
    #!/bin/bash
    (cat << EOF
    #cloud-init
    rancher:
    sysctl:
    vm.max_map_count: 262144
    EOF) > cloud-init.yml
    if ! blkid | grep -q "RANCHER_STATE"; then
    sudo ros install -d /dev/sda --no-reboot -c cloud-init.yml
    sudo reboot
    else
    echo "already installed"
    fi
  2. Make sure that boot order on VM is set on way that it will boot hard disk first because only first boot should happen from ISO file.
ArgonV commented 1 year ago

Thanks for all of your help, I've not had to config the VM boot order in the past. Usually with the VMware autoformat feature, it boots the OS from the ISO, does the install and from then on the ISO still needs to be attached to boot - but the overlay (/var/lib/docker) persists on the attached vmdk disk.

I tried the above, and still do not see it mounting. But fdisk -l still lists it at /dev/sda (it's a 20GB disk)

I'm wondering if I need to add in the format command (mkfs.ext4 /dev/sda) to the above code, before the ros install line?

ArgonV commented 1 year ago

Using this cloud-config set to run on boot now:

#cloud-config
runcmd:
- ["sudo", "mkfs.ext4", "/dev/sda"]
- ["sudo", "mount", "-t", "ext4", "/dev/sda", "/var/lib/docker"]
- ["sudo", "ros", "install", "-d", "/dev/sda", "--no-reboot", "-c", "cloud-init.yml"]
- ["sudo", "reboot"]
rancher:
  sysctl:
    vm.max_map_count: 2621444

The disk formats and it mounted at /var/lib/docker! However after some time when I try a docker stats or df -h I see this error:

cannot read table of mounted file systems: No such file or directory

Should I try a mount point somewhere else? Perhaps just the overlay2 subfolder? As that's what's filling up...

ArgonV commented 1 year ago

Earlier you said: Your cloud-init is invalid as you cannot use same disk as mount and install target.

So I tried this could-config, with no luck:

#cloud-config
runcmd:
- ["sudo", "mkfs.ext4", "/dev/sda"]
- ["sudo", "ros", "install", "-d", "/dev/sda", "--no-reboot", "-c", "cloud-init.yml"]
- ["sudo", "reboot"]
rancher:
  sysctl:
    vm.max_map_count: 2621444

It just boots as normal, without installing or restarting.

ArgonV commented 1 year ago

Ah, I had the path wrong to the cloud file and so now it installs using this:

#cloud-config
runcmd:
- ["sudo", "mkfs.ext4", "/dev/sda"]
- ["sudo", "ros", "install", "-d", "/dev/sda", "--no-reboot", "-c", "/var/lib/rancher/conf/cloud-config.yml"]
- ["sudo", "reboot"]
rancher:
  sysctl:
    vm.max_map_count: 2621444

in a cloud-config.yaml file at a URL that Rancher is telling via a cloud-init URL in the node template. Sadly, I've lost the ability to have automatic console login, and I'm using the ssh keys that Rancher dynamically generates. Anyway to enable back auto console login?

olljanat commented 1 year ago

Did I understand correctly that you are still using Rancher Server 1.6? (Would been useful info earlier here as it does things it's own way).

Then you might want to use something like this https://github.com/rancher/os/issues/723#issue-125226171

ArgonV commented 1 year ago

Oh no, I'm on Rancher Server 2.5.x, and 2.6.x. Does that make a difference here?

olljanat commented 1 year ago

Oh no, I'm on Rancher Server 2.5.x, and 2.6.x.

I'm quite sure that only 1.x versions was called for Rancher Server. 2.x versions are called for just Rancher (or latest documentation looks to be saying Rancher Manager).

Does that make a difference here?

Yes. Rancher 2.4.18 was last version which supported RancherOS https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/rancher-v2-4-18/ and https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/rancher-v2-5-0/ and because BurmillaOS is based on RancherOS it means that they don't support us and to honor that decision we do not support Rancher.

In additionally Rancher is Kubernetes cluster management tool and we do not support Kubernetes at all (look #47 ).

So if you want to use Rancher then it is highly recommend to use some of those Linux distributions which they supports.

ArgonV commented 1 year ago

Ah, I have been running RancherOS for Rancher downstream K8s cluster nodes in v2.5.x and 2.6.x for some time now. I do have paid support on our Prod environment and they (SUSE/Rancher) do honor that deployment. For the actual Rancher "Server" VM's OS I am using Oracle Linux 7 however (here I am talking about the pane of glass that is Rancher, and not the RancherOS or Rancher K8s cluster).

Re: K8s support for BurmillaOS - In our environment, Rancher Server Kubernetes Engine just deploys Docker containers on the downstream cluster nodes in the user docker (not system docker) space. So really there is nothing special with K8s going on here? The cluster provisioner/node provisioner uses boot-to-docker on the VM (ISO), and the Docker engine to bootstrap the VM, deploy the RKE Docker containers that run Kubernetes, and bring it into the downstream cluster with what ever role I assign it on the cluster template. Am I missing something here that makes this not a standard use-case for BurmillaOS? It's all just running Docker containers on the node VM.

olljanat commented 1 year ago

Am I missing something here that makes this not a standard use-case for BurmillaOS? It's all just running Docker containers on the node VM.

What works and what is supported are two different things. We do not test new BurmillaOS versions with Rancher which why example don't release those autoformat ISO files.

Ah, I have been running RancherOS for Rancher downstream K8s cluster nodes in v2.5.x and 2.6.x for some time now. I do have paid support on our Prod environment and they (SUSE/Rancher) do honor that deployment.

That is interesting. Perhaps you should ask from them then that how we can get BurmillaOS listed as supported OS in RKE1 list? (RKE2 does not use Docker at all so that we cannot support without bigger changes). If they are willing to do that then I'm ready to add RKE1 to our testing set.

ArgonV commented 1 year ago

I will indeed ask SUSE/Rancher that, thanks for all of your help and feedback. This has been most helpful in my quest to find a decent RancherOS replacement for RKE1 clusters, without me having to maintain my own VM templates and updates, or distro.

I am hoping the autoformat feature for VMware may be included back in BurmillaOS - I think the use-case is small, but might help along in bringing BurmillaOS to the Rancher sphere of consideration.

olljanat commented 1 year ago

Most likely Longhorn will be the hardest part to get working on BurmillaOS. Found two related issues https://github.com/longhorn/longhorn/issues/828 and https://github.com/longhorn/longhorn/issues/3744

However it should be little bit easier than on RancherOS because we switched to Debian based console and included open-iscsi by default https://github.com/burmilla/os/issues/9

olljanat commented 1 year ago

I assume that there is no good news from Suse because of all this silence so closing.

ArgonV commented 1 year ago

I assume that there is no good news from Suse because of all this silence so closing.

Sadly nothing. Apologies!

olljanat commented 1 year ago

No worries. Are planning to keep using BurmillaOS also in future? If so, then it would good idea to test that your use case still works on v2.0.0-rc1

ArgonV commented 1 year ago

Yes I currently have 2 K8s clusters (one in testing and one in pre-production) that are provisioned via Rancher Server that I'm using to test various deployments.

Rancher/SUSE just announced Elemental - but they don't have any pre-built ISOs for vSphere yet.