DragonFlyBSD / dragonfly-packer

Packer templates for DragonFly BSD
5 stars 5 forks source link

hyperv-iso #1

Open ladar opened 3 years ago

ladar commented 3 years ago

I've been trying to get dfly running on Hyper-V for the last 2 or perhaps 3 years, and it keeps failing. I hoped the 6.0.0 would fix things, but it only got worse.

The short version, sometime ago dfly stopped working properly with the virtual hard disk interface on Hyper-V. I think the last time I was able to build an image, I actually had to swap the VHD out for a physical disk. But it seems that even that trick isn't working anymore.

I've tried switching to an older virtual hardware configuration, but the oldest my test Win 10 system supports is v8.0 which is relatively recent, so I don't know if going back further would help.

I've spent countless hours trying various combinations of boot flags trying to get this fixed, but nothing has worked. I just can't seem to read/write to the disk device.

Disabling DMA, and write caching with hw.ata.ata_wc=0 and hw.ata.ata_dma=0 will eliminate most of the kernel errors, but it doesn't fix the problem. And I'll still see the error message "ad0: timeout waiting for DRQ" ... note that with DMA disabled the drive is forced into PIO4 mode. But neither PIO4 or WDMA2 mode appear to work, and those appear to be the only modes that are supported.

Gen 1 VMs allow IDE or SCSI buses, and legacy network adapters. Moving the virtual disk to a SCSI control on a gen 1 VM doesn't help, since dfly doesn't find it all. In fact it doesn't appear to find the SCSI bus at all (no scbus). Switching to a gen 2 VM doesn't help ,because it requires disks to be on the SCSI bus, and thus doesn't seem them. Gen 2 VM also require the newer virtualized network adapter, which dfly doesn't support.

Can someone tell me what magic combination boot loader params are needed to fix this? I've played around with the following flags (as well as using the natacontrol utility to force a mode manually), but nothing seems to work:

hint.acpi.0.disabled=1
hint.ahci.disabled=1
hint.ahci.force150=1
hint.ahci.nofeatures=1
hint.ata.0.disabled=1
hint.atapci.0.msi=0
hint.ehci.0.disabled=1
hint.xhci.0.disabled=1
hw.ahci.force=1
hw.ahci.msi.enable=0
hw.ata.ata_dma=0
hw.ata.ata_dma_check_80pin=0
hw.ata.ata_wc=0
hw.ata.atapi_dma=0
hw.ata.disk_enable=1
hw.ata.wc=0
hw.bwn.usedma=0
hw.clflush_disable=1
kern.cam.ada.write_cache=0
ladar commented 3 years ago

I forgot to mention that starting with the 6.0.0 release, the legacy network adapter also stopped working. It does, however, with in previous releases.

And as I mentioned above, the dfly doesn't have the drivers necessary to use a non-legacy network adapter.

ladar commented 3 years ago

My server still had support for older virtual hardware revisions, so I tried setting up a VM using the rather old v5.0, but it didn't appear to make a difference.

I did, however, find an article with a note that suggests write caching functioned differently on older versions of Hyper-V. Perhaps that is why it's now broken. From the article:

But if you cannot disable write caching for this virtual hard disk then why was it possible in earlier versions of Hyper-V to disable write caching in a virtual machine using the above Policies property sheet? The answer (as I've been told by a Hyper-V expert at Microsoft) is simply that there was a bug in the Windows ataport and Hyper-V storage stack in earlier versions of Hyper-V that allowed you to change the disk write caching setting of the system drive of a virtual machine if that system drive was backed by a virtual hard disk that used virtual IDE (vIDE). This bug gave users the impression they could disable write caching to improve data integrity for write operations to the virtual hard disk, but in reality all it was really doing was creating the potential for data loss and corruption of the virtual hard disk should the underlying Hyper-V host experience a power outage or an unplanned start (see KB2853952 for details). Microsoft released a fix for this issue as described in that KB article, but the point is that write caching isn't configurable for virtual hard disks on virtual machines--and nor should it be.

ladar commented 3 years ago

And more from the Hyper-V docs:

Guest virtual disk cache. The virtualized IDE (emulated or synthetic) or SCSI device will report the write cache state that is returned by the lower stack. Virtual disks will report that their write cache is enabled, and they refuse to let the guest turn off the write cache. Disabling the cache will fail and will always respond that the cache is enabled. This behavior is necessary for the following reasons:

Hyper-V can't make an assumption that all the VMs that are running on the same disk will have to have the disk cache settings be the same.

The underlying storage might have an always-on write cache that can't be turned off. This is emphasized by the fact that the virtual disk might be migrated to a different disk on the same host (live storage migration) or to a different host (live migration).

Because applications won't be able to turn off disk cache, any application in the guest that has to make sure of data integrity across a power failure will have to use either option 1 or option 2 to make sure that writes bypass the disk cache.

liweitianux commented 3 years ago

Thank you for the detailed report. I'll redirect this issue to relevant developers and try to figure it out.

ladar commented 2 years ago

@liweitianux any progress on the bug, or workarounds?

liweitianux commented 2 years ago

Hi @ladar , sorry for the delay. I asked the developers then but didn't get a reply, and then I forgot about this.

Now I asked again and got some info this time:

dfly doesn't work on hyper-v anymore, it stopped working at some point (before it worked only with the oldest version of vm's and tweaking some stuff, so even then it took some config to make it work). why that is so, i have no clue and it is above my head.

I really don't know. we haven't touched the ATA driver for ages. hyper-v probably doesn't have a complete implementation, causing some incompatibility. he could try turning on the emergency interrupt polling to see if it is interrupt related (both the network adapter and the IDE). e.g. something like: kern.emergency_intr_enable=1 kern.emergency_intr_freq=50 just to see if it helps

One more thing about ATA: it's being problematic on dfly for years, and we never tried to fix it, because SATA (AHCI) or VirtIO should always be used instead. Moreover, Matt Dillon rewrote the callout API last year (e.g., commit fac0eb3cf4c969cfb8eab610321bd0b712266d62), and that might break the ATA further.

Regards.

liweitianux commented 2 years ago

Ok, a bit more info from a developer:

The ATA/IDE driver in DragonFly is not in good state but it's not totally broken. This issue seems to be on Hyper-V side. Old DragonFly versions that worked in Hyper-V years ago now don't boot with current Hyper-V anymore.

So we think it may need significant work to port Hyper-V support from FreeBSD. However, I personally don't see much interests in doing this, due to the limited manpower.

Regards.

ladar commented 2 years ago

@liweitianux I think the ATA driver is calling the ATA equivalent of the sync command, to flush the queued writes to the physical medium, and that was what Microsoft removed from Hyper-V. As I recall, in past, the Hyper-V server would always confirm a write had been flushed to the physical medium, even when it didn't know (possibly because the physical medium was on another server, or a network). That would lead to situations where a crash could cause corruption. So the Microsoft fix was to remove support for those ATA calls, and indicate they aren't supported. My guess is that the functionality in question is actually so ubiquitous when working with real ATA drives, that the driver just doesn't know what to do when it can't use those commands. So it sits around waiting for virtual disk to acknowledge a request that the virtual disk doesn't support.

I think the other BSD drivers must have added logic to handle this. Possibly by ignoring, or not using the functionality in question. I'm guessing, but I bet if I pull out an old enough system, that hasn't received the hotfix/service pack, then Dfly will boot just fine. I think that was how I created the image that 've been recycling for the last couple years on the Vagrant cloud, since I can't update it (it's the only one out of 700+ that I haven't found a solution for).

The hotfix was linked to in the article above, but I think it was this hotfix/change that broke Dfly:

https://support.microsoft.com/en-us/topic/loss-of-consistency-with-ide-attached-virtual-hard-disks-when-a-hyper-v-host-server-experiences-an-unplanned-restart-e0f0bc5b-bf04-2a75-4360-06ae11a13aa6

To quote the hotfix:

This issue occurs because the Hyper-V virtual IDE controller erroneously reports success if an operating system on the guest requests to disable the disk cache. This might result in an application issuing I/O operations that it believes are persisted to disk that are actually being allowed to reside in the disk cache, which would not be persisted across power failures of the Hyper-V host.

Note After you install this update, requests to disable the disk cache in the Hyper-V virtual IDE controller will fail.

I tried poking around the kernel code to see if I could pinpointg the precise place this was happening, but the error message was too generic, and I couldn't find it. I also not an expert on the low level ATA commands/protocol, so I don't know how to fix it.

liweitianux commented 2 years ago

@ladar , thank you for the detailed analysis and information. I'll redirect the info to other developers. I'll also look into it a bit, but I'm not an expert either and don't have an Hyper-V environment at the moment.

ladar commented 2 years ago

If you were in Texas I could probably help you out, but if your profile is right, it would probably be easier, and cheaper to find a spare notebook somewhere to test with, than make the trip here.

You might be able to setup Windows inside a VM, and then run Dfly as a nested guest. I've done nested virtualization in the past, with mixed success, but haven't tried to use Hyper-V in this manner yet.

As I mentioned initially, I've tried a large number of possible kernel options to fix this, and while some altered how the problem manifested, none of them worked.

I'm starting to wonder if the last time I was able to run Dfly and build an image, my solution was to replace the virtual disk with a pass through to a physical one. It occurred to me to try that again recently, but I only had a flash drive at the time, and that didn't work. I'm wondering if a regular hard drive attached via USB would be a different story. Not sure when I'll get a chance to test the theory though, since I've been building my Hyper-V images on a blade server at our datacenter as of late, and I don't know when I'll be able to dig out the Windows laptop and try it.