coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

platforms: investigate support for Azure Stack Hub (azurestack) #476

Closed cfBrianMiller closed 3 years ago

cfBrianMiller commented 4 years ago

Hello,

When trying to boot this image, it fails with the following boot diagnostics.

[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-acecfdfafb8976a8675311239f88ec5442a47472a131f1d0c9113e11a8d2ac13/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 ignition.firstboot rd.neednet=1 ip=dhcp,dhcp6 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/acecfdfafb8976a8675311239f88ec5442a47472a131f1d0c9113e11a8d2ac13/0 ignition.platform.id=azure ... [ 23.398199] ignition-setup[627]: File /usr/lib/ignition/platform/azure/base.ign does not exist.. Skipping copy ... [ 29.878870] systemd[1]: ignition-fetch.service: Main process exited, code=exited, status=1/FAILURE [ 29.878901] ignition[670]: drive status: OK [ 29.878918] systemd[1]: ignition-fetch.service: Failed with result 'exit-code'. [ 29.878957] systemd[1]: Failed to start Ignition (fetch). [ 29.878986] ignition[670]: op(1): [started] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153" [ 29.879035] systemd[1]: Dependency failed for Ignition Complete. [ 29.879166] ignition[670]: op(1): [failed] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument [ 29.879190] systemd[1]: Dependency failed for Initrd Default Target. [ 29.879226] ignition[670]: failed to fetch config: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument [ 29.879246] systemd[1]: initrd.target: Job initrd.target/start failed with result 'dependency'. [ 29.879269] ignition[670]: failed to acquire config: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument [ 29.879285] systemd[1]: initrd.target: Triggering OnFailure= dependencies. [ 29.879321] ignition[670]: Ignition failed: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument

I am deploying to azure stack with this custom data string, (with a valid URL)

eyJpZ25pdGlvbiI6eyJ2ZXJzaW9uIjoiMi4yLjAiLCJjb25maWciOnsicmVwbGFjZSI6eyJzb3VyY2UiOiI8dmFsaWRfdXJsPiJ9fX19Cg==

I am unable to exec into the box to determine the exact error however this is what Microsoft support believes is the issue after troubleshooting CoreOS issues.

It seems the rhcos deployment vhd has Provisioning.DecodeCustomData set to n for Azure Stack. This property needs to be set to y during image preparation.

I am capable of testing fixes to this problem against an up to date azure stack.

Thank you.

lucab commented 4 years ago

Thanks for the report.

A bunch of things to unpack here:

cgwalters commented 4 years ago

As far as I know we haven't done any investigation of Azure Stack; we have https://github.com/coreos/fedora-coreos-tracker/issues/148 which is for the "main" Azure but we should probably break out a separate tracker for Azure Stack.

darkmuggle commented 4 years ago

It seems the rhcos deployment vhd has Provisioning.DecodeCustomData set to n for Azure Stack. This property needs to be set to y during image preparation.

@cfBrianMiller per https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-linux?view=azs-2002#step-2-reference-cloud-inittxt-during-the-linux-vm-deployment it would appear that this is a question of how you are deploying the image I tested the 4.3 and 4.5 image Azure proper and it worked.

To @lucab:

the underlying issue seems to be that the usual Azure "Virtual CD" is not available on the node. Does Provisioning.DecodeCustomData control that, or how is it related here?

Per https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data:

On Linux OS's, custom data is passed to the VM via the ovf-env.xml file, which is copied to the /var/lib/waagent directory during provisioning. Newer versions of the Microsoft Azure Linux Agent will also copy the base64-encoded data to /var/lib/waagent/CustomData as well for convenience.

Provisioning.DecodeCustomData is an instruction to WALinuxAgent (https://github.com/Azure/WALinuxAgent#provisioningdecodecustomdata ala https://github.com/Azure/WALinuxAgent/blob/11d0881cd01e1bc5ff4f918c33701b60274c6e40/bin/waagent2.0#L4579-L4587) and is not relevant here. Ignition will parse any CustomData for Ignition data only and has no understanding the Provisioning values.

I concur with @lucab that the VirtualCD is not being found and hence provisioning fails. See https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L70 where its looking for /dev/disk/by-id/ata-Virtual-CD https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L37.

@cfBrianMiller if you have indeed deployed the VM with custom data properly, we would need at the very least:

On Azure proper I see the Virtual CD-ROM come up in the console logs:

Jun 09 17:16:01 localhost kernel: sd 3:0:1:0: [sdb] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: ata2.00: ATAPI: Virtual CD, , max MWDMA2
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: CD-ROM            Msft     Virtual CD/ROM   1.0  PQ: 0 ANSI: 5
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: Attached scsi generic sg2 type 5
Jun 09 17:16:02 localhost kernel:  sda: sda1 sda2 sda3 sda4
Jun 09 17:16:02 localhost kernel: sd 2:0:0:0: [sda] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: [sr0] scsi3-mmc drive: 0x/0x tray
Jun 09 17:16:02 localhost kernel: cdrom: Uniform CD-ROM driver Revision: 3.20
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: Attached scsi CD-ROM sr

If there is no VirtualCD showing up then question for Azure Stack is how do we access it and, more importantly where is it documented? I did a deep dive into the documentation and the WALinuxAgent code and from what I was able to glean the device should be there.

darkmuggle commented 4 years ago

From the Azure Documentation, an image prepared for Azure proper should work. Microsoft, in an email thread, had indicated that Azure proper images should work on Azure Stack: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/create-upload-generic

darkmuggle commented 4 years ago

Over at https://bugzilla.redhat.com/attachment.cgi?id=1696551 a console log was provided that gave a whole lot more information:

[   14.795315] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[   14.821361] UDF-fs: Scanning with blocksize 2048 failed
[   14.845200] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[   14.870806] UDF-fs: Scanning with blocksize 4096 failed
[   14.893620] ignition[813]: op(1): [failed]   mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure549490584": invalid argument

These error messages are NOT found on Azure proper.

Based on the UDF source [1] the kernel is NOT locating the UDF VRS (Volume Recognition Sequence) and so the mount is returning EINVAL. In other words, the kernel is saying that Ignition asked for a UDF mount but whatever is on /dev/sr0 is not a UDF volume.

There are three potential cases:

Can you attach a copy of the UDF?

Looking at how WALinuxAgent does the mount [3], it does a blind mount without specifying the filesystem; Ignition is more precise [4].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/jwboyer/fedora.git/tree/fs/udf/super.c#n1970 [2] https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-redhat-create-upload-vhd?view=azs-2002 [3] https://github.com/Azure/WALinuxAgent/blob/develop/bin/waagent2.0#L581-L582 [4] https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L69-L74

darkmuggle commented 4 years ago

We have confirmation from Microsoft that the UDF volume is, in fact, not a UDF volume: it's a generic iso9660. I have draft fix proposed in Ignition that should allow Ignition to work on either Azure or Azure Stack.

dustymabe commented 4 years ago

Any chance they'll fix the documentation now?

darkmuggle commented 4 years ago

Any chance they'll fix the documentation now?

We can ask.

darkmuggle commented 4 years ago

A complete different issue in Afterburn has come up:

s)...[   32.768570] NetworkManager[568]: <info>  [1593115470.6525] dhcp4 (eth0): option private_245          => 'a8:3f:81:10'

And then:

[   64.908820] afterburn[658]: Jun 25 19:57:52.985 WARN Failed to get fabric address from DHCP: maximum number of retries (60) reached
[   64.988395] afterburn[658]: Jun 25 19:57:52.986 INFO Using fallback address
[   65.033307] afterburn[658]: Jun 25 19:57:52.986 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
^M[     *] A start job is running for Afterburn Hostname (52s / no limit)
[   65.566088] afterburn[658]: Jun 25 19:57:53.643 INFO Fetch successful
[   65.621959] afterburn[658]: Jun 25 19:57:53.643 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
[   65.698749] afterburn[658]: Jun 25 19:57:53.651 INFO Fetch successful
[   65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1
[   65.942651] afterburn[658]: Jun 25 19:57:53.674 INFO Failed to fetch: 500 Internal Server Error

And ending with:

Displaying logs from failed units: afterburn-hostname.service
-- Logs begin at Thu 2020-06-25 20:04:16 UTC, end at Thu 2020-06-25 20:05:59 UTC. --
Jun 25 20:05:51 afterburn[655]: Jun 25 20:05:51.338 INFO Failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 afterburn[655]: Error: failed to run
Jun 25 20:05:51 afterburn[655]: Caused by: writing hostname
Jun 25 20:05:51 afterburn[655]: Caused by: failed to get hostname
Jun 25 20:05:51 afterburn[655]: Caused by: maximum number of retries (10) reached
Jun 25 20:05:51 afterburn[655]: Caused by: failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=1/FAILURE
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'.
Jun 25 20:05:51 systemd[1]: Failed to start Afterburn Hostname.

Both the Ignition issue and now Afterburn raises two distinct differences that raise the questions of what other differences exist. In my opinion, we should consider whether Azure and AzureStack should be considered the same.

dustymabe commented 4 years ago

@darkmuggle - maybe a new issue for the afterburn bits? or if you want to go wide - an issue to discuss our approach to Azure vs AzureStack

cfBrianMiller commented 4 years ago

The problem child is definitely this line:

[   65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1

Azure stack has different API versions, for compute it is 2017-12-01

lucab commented 4 years ago

Things we have discovered so far on Azure Stack:

Things still to discover:

jlebon commented 4 years ago

Should https://github.com/coreos/ignition/pull/1007 be reverted for the time being?

bgilbert commented 4 years ago

I'd say so, yes.

darkmuggle commented 4 years ago

Should coreos/ignition#1007 be reverted for the time being?

Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.

Conceivably, enabling AzureStack as a separate platform from the Ignition side would look something akin to:

diff --git a/internal/platform/platform.go b/internal/platform/platform.go
index a5a4844..a9674a4 100644
--- a/internal/platform/platform.go
+++ b/internal/platform/platform.go
@@ -91,6 +91,10 @@ func init() {
                name:  "azure",
                fetch: azure.FetchConfig,
        })
+       configs.Register(Config{
+               name:  "azurestack",
+               fetch: azure.FetchConfig,
+       })
        configs.Register(Config{
                name:  "brightbox",
                fetch: openstack.FetchConfig,

The difficulties with AzureStack in Afterburn may be handled differently.

Also, the code now checks to see if the volume is either a UDF or ISO9660 before blindly attempting to mount it as a UDF volume.

bgilbert commented 4 years ago

Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.

If we release the code and later roll it back, we'll be making shipped code stricter, which in principle could break someone.

Why not just go ahead and add the separate platform ID to Ignition now? It should be straightforward to add a wrapper which enables ISO9660 only on Azure Stack.

cgwalters commented 4 years ago

Also https://github.com/coreos/coreos-assembler/pull/1566

darkmuggle commented 4 years ago

AzureStack is now a distinct platform for both Ignition and COSA. The word from back-channels is that might have what we need towards August/September on the Afterburn side.

lucab commented 4 years ago

"Azure Stack" is a whole product family, which spans a few verticals. My understanding is that here we are targeting "Azure Stack Hub" only for its computing on-demand capabilities. Re-titled accordingly.

jlebon commented 3 years ago

This is done now in https://github.com/coreos/afterburn/pull/561 which is part of the v5.0.0 release.