Closed cfBrianMiller closed 3 years ago
Thanks for the report.
A bunch of things to unpack here:
Provisioning.DecodeCustomData
control that, or how is it related here?As far as I know we haven't done any investigation of Azure Stack; we have https://github.com/coreos/fedora-coreos-tracker/issues/148 which is for the "main" Azure but we should probably break out a separate tracker for Azure Stack.
It seems the rhcos deployment vhd has Provisioning.DecodeCustomData set to n for Azure Stack. This property needs to be set to y during image preparation.
@cfBrianMiller per https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-linux?view=azs-2002#step-2-reference-cloud-inittxt-during-the-linux-vm-deployment it would appear that this is a question of how you are deploying the image I tested the 4.3 and 4.5 image Azure proper and it worked.
To @lucab:
the underlying issue seems to be that the usual Azure "Virtual CD" is not available on the node. Does Provisioning.DecodeCustomData control that, or how is it related here?
Per https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data:
On Linux OS's, custom data is passed to the VM via the ovf-env.xml file, which is copied to the /var/lib/waagent directory during provisioning. Newer versions of the Microsoft Azure Linux Agent will also copy the base64-encoded data to /var/lib/waagent/CustomData as well for convenience.
Provisioning.DecodeCustomData
is an instruction to WALinuxAgent (https://github.com/Azure/WALinuxAgent#provisioningdecodecustomdata ala https://github.com/Azure/WALinuxAgent/blob/11d0881cd01e1bc5ff4f918c33701b60274c6e40/bin/waagent2.0#L4579-L4587) and is not relevant here. Ignition will parse any CustomData for Ignition data only and has no understanding the Provisioning
values.
I concur with @lucab
that the VirtualCD is not being found and hence provisioning fails. See https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L70 where its looking for /dev/disk/by-id/ata-Virtual-CD
https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L37.
@cfBrianMiller if you have indeed deployed the VM with custom data properly, we would need at the very least:
journalctl --system
)/run/ignition*
find /dev/disk
On Azure proper I see the Virtual CD-ROM come up in the console logs:
Jun 09 17:16:01 localhost kernel: sd 3:0:1:0: [sdb] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: ata2.00: ATAPI: Virtual CD, , max MWDMA2
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: CD-ROM Msft Virtual CD/ROM 1.0 PQ: 0 ANSI: 5
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: Attached scsi generic sg2 type 5
Jun 09 17:16:02 localhost kernel: sda: sda1 sda2 sda3 sda4
Jun 09 17:16:02 localhost kernel: sd 2:0:0:0: [sda] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: [sr0] scsi3-mmc drive: 0x/0x tray
Jun 09 17:16:02 localhost kernel: cdrom: Uniform CD-ROM driver Revision: 3.20
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: Attached scsi CD-ROM sr
If there is no VirtualCD showing up then question for Azure Stack is how do we access it and, more importantly where is it documented? I did a deep dive into the documentation and the WALinuxAgent code and from what I was able to glean the device should be there.
From the Azure Documentation, an image prepared for Azure proper should work. Microsoft, in an email thread, had indicated that Azure proper images should work on Azure Stack: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/create-upload-generic
Over at https://bugzilla.redhat.com/attachment.cgi?id=1696551 a console log was provided that gave a whole lot more information:
[ 14.795315] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[ 14.821361] UDF-fs: Scanning with blocksize 2048 failed
[ 14.845200] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[ 14.870806] UDF-fs: Scanning with blocksize 4096 failed
[ 14.893620] ignition[813]: op(1): [failed] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure549490584": invalid argument
These error messages are NOT found on Azure proper.
Based on the UDF source [1] the kernel is NOT locating the UDF VRS (Volume Recognition Sequence) and so the mount is returning EINVAL. In other words, the kernel is saying that Ignition asked for a UDF mount but whatever is on /dev/sr0
is not a UDF volume.
There are three potential cases:
Can you attach a copy of the UDF?
Looking at how WALinuxAgent does the mount [3], it does a blind mount without specifying the filesystem; Ignition is more precise [4].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jwboyer/fedora.git/tree/fs/udf/super.c#n1970 [2] https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-redhat-create-upload-vhd?view=azs-2002 [3] https://github.com/Azure/WALinuxAgent/blob/develop/bin/waagent2.0#L581-L582 [4] https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L69-L74
We have confirmation from Microsoft that the UDF volume is, in fact, not a UDF volume: it's a generic iso9660. I have draft fix proposed in Ignition that should allow Ignition to work on either Azure or Azure Stack.
Any chance they'll fix the documentation now?
Any chance they'll fix the documentation now?
We can ask.
A complete different issue in Afterburn has come up:
s)...[ 32.768570] NetworkManager[568]: <info> [1593115470.6525] dhcp4 (eth0): option private_245 => 'a8:3f:81:10'
And then:
[ 64.908820] afterburn[658]: Jun 25 19:57:52.985 WARN Failed to get fabric address from DHCP: maximum number of retries (60) reached
[ 64.988395] afterburn[658]: Jun 25 19:57:52.986 INFO Using fallback address
[ 65.033307] afterburn[658]: Jun 25 19:57:52.986 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
^M[ *] A start job is running for Afterburn Hostname (52s / no limit)
[ 65.566088] afterburn[658]: Jun 25 19:57:53.643 INFO Fetch successful
[ 65.621959] afterburn[658]: Jun 25 19:57:53.643 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
[ 65.698749] afterburn[658]: Jun 25 19:57:53.651 INFO Fetch successful
[ 65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1
[ 65.942651] afterburn[658]: Jun 25 19:57:53.674 INFO Failed to fetch: 500 Internal Server Error
And ending with:
Displaying logs from failed units: afterburn-hostname.service
-- Logs begin at Thu 2020-06-25 20:04:16 UTC, end at Thu 2020-06-25 20:05:59 UTC. --
Jun 25 20:05:51 afterburn[655]: Jun 25 20:05:51.338 INFO Failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 afterburn[655]: Error: failed to run
Jun 25 20:05:51 afterburn[655]: Caused by: writing hostname
Jun 25 20:05:51 afterburn[655]: Caused by: failed to get hostname
Jun 25 20:05:51 afterburn[655]: Caused by: maximum number of retries (10) reached
Jun 25 20:05:51 afterburn[655]: Caused by: failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=1/FAILURE
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'.
Jun 25 20:05:51 systemd[1]: Failed to start Afterburn Hostname.
Both the Ignition issue and now Afterburn raises two distinct differences that raise the questions of what other differences exist. In my opinion, we should consider whether Azure and AzureStack should be considered the same.
@darkmuggle - maybe a new issue for the afterburn bits? or if you want to go wide - an issue to discuss our approach to Azure vs AzureStack
The problem child is definitely this line:
[ 65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1
Azure stack has different API versions, for compute it is 2017-12-01
Things we have discovered so far on Azure Stack:
iso9660
OR udf
volume (ref: this ticket)Things still to discover:
Should https://github.com/coreos/ignition/pull/1007 be reverted for the time being?
I'd say so, yes.
Should coreos/ignition#1007 be reverted for the time being?
Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.
Conceivably, enabling AzureStack as a separate platform from the Ignition side would look something akin to:
diff --git a/internal/platform/platform.go b/internal/platform/platform.go
index a5a4844..a9674a4 100644
--- a/internal/platform/platform.go
+++ b/internal/platform/platform.go
@@ -91,6 +91,10 @@ func init() {
name: "azure",
fetch: azure.FetchConfig,
})
+ configs.Register(Config{
+ name: "azurestack",
+ fetch: azure.FetchConfig,
+ })
configs.Register(Config{
name: "brightbox",
fetch: openstack.FetchConfig,
The difficulties with AzureStack in Afterburn may be handled differently.
Also, the code now checks to see if the volume is either a UDF or ISO9660 before blindly attempting to mount it as a UDF volume.
Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.
If we release the code and later roll it back, we'll be making shipped code stricter, which in principle could break someone.
Why not just go ahead and add the separate platform ID to Ignition now? It should be straightforward to add a wrapper which enables ISO9660 only on Azure Stack.
AzureStack is now a distinct platform for both Ignition and COSA. The word from back-channels is that might have what we need towards August/September on the Afterburn side.
"Azure Stack" is a whole product family, which spans a few verticals. My understanding is that here we are targeting "Azure Stack Hub" only for its computing on-demand capabilities. Re-titled accordingly.
This is done now in https://github.com/coreos/afterburn/pull/561 which is part of the v5.0.0 release.
Hello,
When trying to boot this image, it fails with the following boot diagnostics.
I am deploying to azure stack with this custom data string, (with a valid URL)
eyJpZ25pdGlvbiI6eyJ2ZXJzaW9uIjoiMi4yLjAiLCJjb25maWciOnsicmVwbGFjZSI6eyJzb3VyY2UiOiI8dmFsaWRfdXJsPiJ9fX19Cg==
I am unable to exec into the box to determine the exact error however this is what Microsoft support believes is the issue after troubleshooting CoreOS issues.
I am capable of testing fixes to this problem against an up to date azure stack.
Thank you.