flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
679 stars 29 forks source link

Beta 3185.1.0: ignition fails to create partition on second disk (vmware) #729

Open defo89 opened 2 years ago

defo89 commented 2 years ago

Description

With Beta 3185.1.0 and ignition v3 we observe issues when vSphere VM has more than one disk.

Impact

Cannot deploy VM.

Environment and steps to reproduce

  1. Set-up: Flatcar VM deployed in vSphere 7 using terraform-provider-vsphere v2.0.2
  2. Task: Deploy Flatcar Beta 3185.1.0 OVA using Ignition v3 spec file (as vapp)
  3. Error: Ignition fails with: create partitions failed: Failed to pretend to create partitions: exit status 4. Stderr: Could not create partition 1 from 4194304 to 20975714303. Sometimes ignition fails without an error message. In both cases entering Emergency shell is not possible (reboot loop).

ignition-v3-disk-error

Expected behavior

VM is deployed as it is the case with Flatcar Stable 3139.2.0 OVA with Ignition v2 spec file

Additional information

To narrow it down to Beta release, same ignition json is used (just few lines edited that differ between v2 and v3 spec file). Attaching both files to the issue.

VM config to reproduce:

provider "vsphere" {
  user                 = "user"
  password             = var.password
  vsphere_server       = "vc-server-url"
  persist_session      = true
  client_debug         = true
}

data "vsphere_datacenter" "dc" {
  name = "DC"
}

data "vsphere_datastore_cluster" "datastore" {
  name          = "datastore"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_compute_cluster" "cluster" {
  name          = "cluster"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_virtual_machine" "template" {
  name          = "flatcar_production_vmware_beta"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_network" "network" {
  name          = "network"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "local_file" "ignitions" {
  filename = "ignition.json"
}

resource "vsphere_virtual_machine" "vm" {
  name             = "beta-ignition-v3"
  resource_pool_id = "${data.vsphere_compute_cluster.cluster.resource_pool_id}"
  datastore_cluster_id = "${data.vsphere_datastore_cluster.datastore.id}"

  num_cpus = 2
  memory   = 1024
  guest_id = "${data.vsphere_virtual_machine.template.guest_id}"
  scsi_type = "${data.vsphere_virtual_machine.template.scsi_type}"

  network_interface {
    network_id   = "${data.vsphere_network.network.id}"
    adapter_type = "${data.vsphere_virtual_machine.template.network_interface_types[0]}"
  }

  disk {
    label            = "disk0"
    size             = "64"
    unit_number      = "0"
    eagerly_scrub    = false
    thin_provisioned = true
  }

  disk {
    label            = "disk1"
    size             = "64"
    unit_number      = "1"
    eagerly_scrub    = false
    thin_provisioned = true
  }

  clone {
    template_uuid = "${data.vsphere_virtual_machine.template.id}"
  }

vapp {
    properties = {
      "guestinfo.ignition.config.data"          = base64gzip(data.local_file.ignitions.content)
      "guestinfo.ignition.config.data.encoding" = "gz+base64"
    }
  }
}
defo89 commented 2 years ago

Ignition file for Flatcar Beta 3185.1.0 (failing) ignition-v3-example.json.txt

Ignition file for Flatcar Stable 3139.2.0 (working) ignition-v2-example.json.txt

defo89 commented 2 years ago

Hi @pothos, I have stumbled across your PR https://github.com/coreos/ignition/pull/1319 which is not merged yet and is planned for coreos/ignition release 2.14.0. I was wondering if this could be related. Although I am not sure if Flatcar Beta 3185.1.0 (ignition 2.13.0) is already using the updated code.

jepio commented 2 years ago

What's the value of data.vsphere_virtual_machine.template.scsi_type? Can you paste the yaml you use to create the ignition json (both for v2 and v3)?

defo89 commented 2 years ago

Ignition v3 file (sorry have to add .txt to upload) ignition.tf.txt Using this provider to create v3 spec file https://github.com/community-terraform-providers/terraform-provider-ignition

To avoid messing with v2, I just edit v3 file to make it to v2.

And for scsi_type:

output "template" {
  value = data.vsphere_virtual_machine.template.scsi_type
}

Outputs:
template = pvscsi

I missed to provide output of device paths when VM comes up (with disk attached but without ignition_disk part).

# ls -la /dev/disk/by-path
total 0
drwxr-xr-x. 2 root root 220 May  5 14:40 .
drwxr-xr-x. 9 root root 180 May  5 14:39 ..
lrwxrwxrwx. 1 root root   9 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part3 -> ../../sda3
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part4 -> ../../sda4
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part6 -> ../../sda6
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part7 -> ../../sda7
lrwxrwxrwx. 1 root root  10 May  5 14:40 pci-0000:03:00.0-scsi-0:0:0:0-part9 -> ../../sda9
lrwxrwxrwx. 1 root root   9 May  5 14:39 pci-0000:03:00.0-scsi-0:0:1:0 -> ../../sdb

Hope this helps.

pothos commented 2 years ago

Hi @pothos, I have stumbled across your PR coreos/ignition#1319 which is not merged yet and is planned for coreos/ignition release 2.14.0. I was wondering if this could be related. Although I am not sure if Flatcar Beta 3185.1.0 (ignition 2.13.0) is already using the updated code.

The fix is already part of our Flatcar release.

Can you try the same v2 config on 3185.1.0? It will be translated to v3 on the fly and I wonder it could make a difference.

defo89 commented 2 years ago

Hi @pothos, I have stumbled across your PR coreos/ignition#1319 which is not merged yet and is planned for coreos/ignition release 2.14.0. I was wondering if this could be related. Although I am not sure if Flatcar Beta 3185.1.0 (ignition 2.13.0) is already using the updated code.

The fix is already part of our Flatcar release.

Can you try the same v2 config on 3185.1.0? It will be translated to v3 on the fly and I wonder it could make a difference.

Thanks for confirming. I have tried with same v2 config json on 3185.1.0 - getting the same error.

defo89 commented 2 years ago

Just confirmed that same is happening with latest beta 3227.1.0.

jepio commented 2 years ago

Hi @defo89, looked into this: Right now the ignition conversion does not handle ignition version 2.1.0, that's why ignition-v2.json is failing on newer Flatcar's. You can make it work by manually editing it in the following way:

--- a/ignition-v2-example.json.txt
+++ b/ignition-v2-example.json.txt
@@ -2,7 +2,7 @@
     "ignition": {
         "config": {},
         "timeouts": {},
-        "version": "2.1.0"
+        "version": "2.3.0"
     },
     "passwd": {
         "users": [
@@ -15,13 +15,13 @@
     "storage": {
         "disks": [
             {
-                "device": "/dev/disk/by-path/pci-0000:00:07.0",
+                "device": "/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0",
                 "partitions": [
                     {
                         "label": "etc-test",
                         "number": 1,
-                        "size": 10240000,
-                        "start": 2048,
+                        "sizeMiB": 5120000,
+                        "startMiB": 1024,
                         "typeGuid": ""
                     }
                 ]

The older "size" and "start" properties are expressed in sectors, which is mostly 512 bytes.

As to ignition-v3.json not working: are you sure your disk is 10TB in size? It is also possible that things are failing because the disks are getting reordered (/dev/sda swapped with /dev/sdb). Things might be better if you attach the disk to a separate scsi controller instead of having both disks under the same one. You're already using stable device paths so nevermind. If the v2 json file works after runtime conversion by ignition, then v3.json should also work (it does in my testing).

defo89 commented 2 years ago

Thanks for looking at this @jepio. For now I worked this around by switching to a single vsphere disk for the affected VMs.

~On the related note, is there an ETA for bringing ignition-v3 to stable release (in other words, when >=3185.0.0 will become stable)?~ nvm, it's now in stable

TimoKramer commented 5 months ago

Seeing this quite often when updating and replacing Flatcar with an attached durable disk:

Ignition finished successfully
Ignition 2.15.0
Stage: kargs
no configs at "/usr/lib/ignition/base.d"
no config dir at "/usr/lib/ignition/base.platform.d/azure"
kargs: kargs passed
Ignition finished successfully
Ignition 2.15.0
Stage: disks
no configs at "/usr/lib/ignition/base.d"
no config dir at "/usr/lib/ignition/base.platform.d/azure"
disks: createPartitions: op(1): [started]  waiting for devices [/dev/disk/azure/scsi1/lun1]
disks: createPartitions: op(1): [finished] waiting for devices [/dev/disk/azure/scsi1/lun1]
disks: createPartitions: created device alias for "/dev/disk/azure/scsi1/lun1": "/run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1" -> "/dev/sda"
disks: createPartitions: op(2): [started]  partitioning "/run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1"
disks: createPartitions: op(2): op(3): [started]  reading partition table of "/run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1"
disks: createPartitions: op(2): op(3): [finished] reading partition table of "/run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1"
disks: createPartitions: op(2): running sgdisk with options: [--pretend --new=0:0:+0 /run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1]
disks: createPartitions: op(2): [failed]   partitioning "/run/ignition/dev_aliases/dev/disk/azure/scsi1/lun1": Failed to pretend to create partitions. Err: exit status 4. Stderr: Could not create partition 3 from 0 to 33
Error encountered; not saving changes.
disks failed
Full config:
{
  "ignition": {
    "config": {
      "replace": {
        "verification": {}
      }
    },
    "proxy": {},
    "security": {
      "tls": {}
    },
    "timeouts": {},
    "version": "3.5.0-experimental"
  },...

Flatcar version: 3815.2.0 Butane version: 0.19.0

Only deleting the disk brings me forward when this happens. It does not happen all the time though...

This is the disk setup I am using in the butane template:

variant: flatcar
version: 1.0.0

storage:
  disks:
    - device: /dev/disk/azure/scsi1/lun1
      partitions:
        - label: portal
  filesystems:
    - device: /dev/disk/by-partlabel/portal
      format: ext4
      wipe_filesystem: true
      label: portal
jepio commented 5 months ago

Isn't that a different issue, related to terraform: https://github.com/flatcar/flatcar-website/pull/296 ?

TimoKramer commented 5 months ago

Isn't that a different issue, related to terraform

No, this is not related. This is a problem with an already existing disk when recreating the flatcar VM.

pothos commented 5 months ago

So there is some race involved and it doesn't always happen? The same error message was reported in https://github.com/coreos/bugs/issues/2100#issuecomment-499003464

Edit: answer from there says the same as Jeremi below

jepio commented 5 months ago

@TimoKramer: your partition is missing an explicit number: 1. you're falling into this behavior:

partitions (list of objects): the list of partitions and their configuration for this particular disk. Every partition must have a unique number, or if 0 is specified, a unique label. number (integer): the partition number, which dictates its position in the partition table (one-indexed). If zero, use the next available partition slot.

so I understand that you would expect the match to happen on the label field, but ignition tries to create a new partition on every rerun. After the first provisioning the disk has no more free space.