TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.57k stars 244 forks source link

Crash on boot in Hyper-V #976

Open justledbetter opened 3 years ago

justledbetter commented 3 years ago

I'm seeing a crash on boot when trying to startup under Hyper-V. Type 2 VM, virtualization extensions are enabled on the CPU, the system is configured with two vCPUs.

Booting...
illumos Version joyent_20210114T163038Z 64-bit

panic[cpu0]/thread=fffffffffbc4a0c0: microfind: could not calibrate delay loop

Warning - stack not written to the dump buffer
fffffffffbc4a220 unix:microfind+b2 ()
fffffffffbc4a2a0 unix:startup_modules+20 ()
fffffffffbc4a2b0 unix:startup+55 ()
fffffffffbc4a2f0 genunix:main+36 ()
fffffffffbc4a300 unix:_locore_start+90 ()

skipping system dump - no dump device configured
rebooting...
<system hangs here>

The same crash reproduces on OmniOS CE r151036.

jasonbking commented 3 years ago

Would you be willing to try a test PI w/ a fix? The issue is that the Hyper-V Type 2 VMs don't emulate the i8254 PIT timer.

justledbetter commented 3 years ago

Very happy to try anything you suggest, would just need instructions, as I'm new to debugging Illumos.

I had also tried booting up in a Generation 1 VM, and recall it hanging and not producing any output. Is there a way to enable more verbose output, to maybe see where it is hanging?

jasonbking commented 3 years ago

In the boot loader, under boot options, there should be an option to enable verbose booting. Let me build and image w/ the fix (will take a bit) and I'll include a temporary link you can use to download it.

justledbetter commented 3 years ago

I'll be standing by!

jasonbking commented 3 years ago

What format media do you prefer? .tgz, .iso, or .usb?

justledbetter commented 3 years ago

.iso works best for me, Thanks!

jasonbking commented 3 years ago

try https://us-east.manta.joyent.com/jbk/public/tmp/platform-20210224T212002Z.iso -- that should use the HPET instead of the PIT to calibrate things

justledbetter commented 3 years ago

This one crashes in a loop with messages repeating:

panic[cpu0]/thread=fffffffffbc4a0c0: bad DTrace trap

panic: entering debugger (continue to reboot)

...and then it eventually hangs.

jasonbking commented 3 years ago

.. That is interesting... can you go into the boot options menu in the boot loader, enable verbose boot as well as kmdb? That should drop you to the KMDB prompt where you can use $C to get a stack trace

justledbetter commented 3 years ago

OK enabling the debugger, I get the following output:

panic[cpu0]/thread=fffffffffbc4a0c0: Failed to calibrate TSC
fffffffffbc8a280 unix:tsc_calibrate+16f ()
fffffffffbc8a2a0 unix:startup_tsc+18 ()
... startup+4a ()
etc

(Sorry, have to copy paste with my eyes :) )

Note I cannot input any text in the debugger after the crash.

jasonbking commented 3 years ago

I think I see what happened.. I can do a quick incremental build (though it'll take a bit to re-upload)..

jasonbking commented 3 years ago

I've uploaded a new ISO image (same path). Let me know how that one works -- if nothing else, you should get a new error :) though hopefully it actually works.

justledbetter commented 3 years ago

It gets past that error now, but still hangs later. The output of your module reads:

TSC calibrated using hyperv; freq is 0 MHzSMBIOS v3.1 loaded (961 bytes)initialized model-specific module 'cpu_ms.GenuineIntel' on chip 0 core 0 strand 0

(lack of newlines as in the original)

Now the startup hangs at:

ramdisk0 at root
ramdisk0 is /ramdisk
WARNING: Last shutdown is later than time on time-of-day chip; check date.
root on /ramdisk:a fstype ufs
/cpus (cpunex0) online
pseudo-device dld0
dld0 is /pseudo/dld@0
<hang>

Could it be hanging due to the assumed 0 MHz clock? Perhaps a div-zero trap that's not presenting itself as a crash?

I am trying to hit F1+A, but it's not doing anything. Not sure if this is similar to before (unable to enter text) or if there's something else going on. Is there a way to interrupt boot and set a breakpoint before it gets to this point in order to trace further?

Update: When booting in non-verbose mode, I get the following additional hint:

WARNING: Last shutdown is later than time on time-of-day chip; check date.
WARNING: Time of Day clock error: reason [Stalled]. -- Stopped tracking Time Of Day clock.

Also: While the system is hung, Hyper-V reports a constant 2% CPU usage (this VM has 2 vCPUs out of a 12-core system, iirc)

jasonbking commented 3 years ago

Ok.. I'm trying one more thing and have updated the link again.. see if that works any better..

justledbetter commented 3 years ago

It's back to crashing at tsc_calibrate+16f () (with Failed to calibrate TSC as listed out above)

justledbetter commented 3 years ago

Standing by to continue testing any time you need me to -- Thanks very much for all the help so far!