Closed Andy2244 closed 6 years ago
@bwarden could you take a look to see if this is a Clear issue vs a hyper-v bug.
From what I can see, everything should be in place. Please make sure you've installed the os-cloudguest-azure bundle for the userspace utilities, and follow this guide to verify that the kernel modules are installed properly: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/manage/manage-hyper-v-integration-services#start-and-stop-an-integration-service-from-a-linux-guest
Please also provide your client kernel version (uname -r) so I can make sure I'm looking at the right kernel build.
@bwarden oki i did not have the "os-cloudguest-azure bundle" installed, is this bundle needed to get basic integration services working? (shutdown/time)
After installing the bundle and rebooting the vm i get this:
uname -r
4.14.21-123.hyperv
lsmod | grep hv_utils
= nothing
lsmod | grep hv_
hv_netvsc 49152 0
compgen -c hv_
hv_fcopy_daemon
hv_kvp_daemon
hv_vss_daemon
ps -ef | grep hv
root 42 2 0 11:08 ? 00:00:00 [hv_vmbus_con]
root 95 2 0 11:08 ? 00:00:00 [hv_balloon]
root 1867 1 0 11:10 ? 00:00:00 hv_kvp_daemon
root 2429 1711 0 11:16 pts/0 00:00:00 grep hv
On the windows side i get this:
Get-Service -Name vm*
Status Name DisplayName
------ ---- -----------
Running vmcompute Hyper-V Host Compute Service
Stopped vmicguestinterface Hyper-V Guest Service Interface
Stopped vmicheartbeat Hyper-V Heartbeat Service
Stopped vmickvpexchange Hyper-V Data Exchange Service
Stopped vmicrdv Hyper-V Remote Desktop Virtualizati...
Stopped vmicshutdown Hyper-V Guest Shutdown Service
Stopped vmictimesync Hyper-V Time Synchronization Service
Stopped vmicvmsession Hyper-V PowerShell Direct Service
Stopped vmicvss Hyper-V Volume Shadow Copy Requestor
Running vmms Hyper-V Virtual Machine Management
Get-VMIntegrationService -VMName "clear linux"
VMName Name Enabled PrimaryStatusDescription SecondaryStatusDescription
------ ---- ------- ------------------------ --------------------------
clear linux Guest Service Interface True OK
clear linux Heartbeat True OK
clear linux Key-Value Pair Exchange True OK The protocol version of the component installed in the virtual machine does not match the version expec...
clear linux Shutdown True OK
clear linux Time Synchronization True OK
clear linux VSS False OK
So it seems the "hv_kvp_daemon" is running, but the integration services are not supported/started for the VM? Manually trying to start the services fails, with the info that it has to-be supported by both the host + VM or that the service is not needed and therefor start/stopped automatically.
lsusb is OK -- we actually build hv_utils into the kernel statically. The os-cloudguest-azure bundle provides the user-space tools (hv_fcopy_daemon, etc).
Looks like the services are running in the Clear Linux guest. The Get-Service step above applies to Windows guest VMs. Did you follow this section (from the Hyper-V manager) to ensure they're enabled on the host? Probably equivalent to Get-VMIntegrationService, but I'm not very familiar with Hyper-V.
With the 22010 image, and those enabled on my system, I see time update automatically whenever I resume the VM.
You can also try "sudo journalctl | grep hv_" to see the status of the daemons.
@bwarden So the "Stopped vmictimesync" is only used for windows guests? Yet in the documentation they don't actually provide which service/daemon is responsible for time/shutdown on the linux side?
Here is the journalctl and the date still lists the last shutdown time (1:14), while i typed this command at 10:18 CEST. So i only get a time update via NTP once it kicks in at its normal update interval, without it i wont get any valid time at all.
Wed Apr 25 01:14:24 CEST 2018
root@clear-vm ~ # sudo journalctl | grep hv_
Apr 24 10:54:30 clear-vm kernel: hv_utils: Shutdown request received - graceful shutdown initiated
Apr 24 10:54:54 clear-vm kernel: calling netvsc_drv_init+0x0/0x1000 [hv_netvsc] @ 163
Apr 24 10:54:54 clear-vm kernel: hv_vmbus: registering driver hv_netvsc
Apr 24 10:54:54 clear-vm kernel: initcall netvsc_drv_init+0x0/0x1000 [hv_netvsc] returned 0 after 65 usecs
Apr 24 10:54:55 clear-vm hv_vss_daemon[210]: Hyper-V VSS: VSS starting; pid is:210
Apr 24 10:54:55 clear-vm hv_vss_daemon[210]: Hyper-V VSS: open /dev/vmbus/hv_vss failed; error: 2 No such file or dir ectory
Apr 24 10:54:55 clear-vm kernel: hv_utils: KVP IC version 4.0
Apr 24 10:55:39 clear-vm kernel: hv_balloon: Max. dynamic memory size: 2560 MB
Apr 24 10:58:50 clear-vm kernel: hv_utils: Shutdown IC version 3.0
Apr 24 10:58:52 clear-vm kernel: hv_utils: TimeSync IC version 4.0
Apr 24 10:58:50 clear-vm kernel: hv_utils: Heartbeat IC version 3.0
Apr 24 10:58:50 clear-vm kernel: hv_utils: FCopy IC version 1.1
Apr 24 11:05:28 clear-vm kernel: hv_utils: Shutdown request received - graceful shutdown initiated
Apr 24 11:05:54 clear-vm kernel: calling netvsc_drv_init+0x0/0x1000 [hv_netvsc] @ 151
Apr 24 11:05:54 clear-vm kernel: hv_vmbus: registering driver hv_netvsc
Apr 24 11:05:54 clear-vm kernel: initcall netvsc_drv_init+0x0/0x1000 [hv_netvsc] returned 0 after 382 usecs
Apr 24 11:05:55 clear-vm hv_vss_daemon[214]: Hyper-V VSS: VSS starting; pid is:214
Apr 24 11:05:55 clear-vm hv_vss_daemon[214]: Hyper-V VSS: open /dev/vmbus/hv_vss failed; error: 2 No such file or dir ectory
Apr 24 11:05:55 clear-vm kernel: hv_utils: KVP IC version 4.0
Apr 24 11:06:39 clear-vm kernel: hv_balloon: Max. dynamic memory size: 2560 MB
Apr 24 11:08:25 clear-vm kernel: hv_utils: Shutdown request received - graceful shutdown initiated
Apr 24 11:08:48 clear-vm kernel: calling netvsc_drv_init+0x0/0x1000 [hv_netvsc] @ 157
Apr 24 11:08:48 clear-vm kernel: hv_vmbus: registering driver hv_netvsc
Apr 24 11:08:48 clear-vm kernel: initcall netvsc_drv_init+0x0/0x1000 [hv_netvsc] returned 0 after 42 usecs
Apr 24 11:08:49 clear-vm kernel: hv_utils: KVP IC version 4.0
Apr 24 11:08:49 clear-vm hv_vss_daemon[214]: Hyper-V VSS: VSS starting; pid is:214
Apr 24 11:08:49 clear-vm hv_vss_daemon[214]: Hyper-V VSS: open /dev/vmbus/hv_vss failed; error: 2 No such file or dir ectory
Apr 24 11:09:33 clear-vm kernel: hv_balloon: Max. dynamic memory size: 2560 MB
The "clear-vm kernel: hv_utils: TimeSync IC version 4.0" line would indicate that it has support for it, yet i get no update. "Get-VMIntegrationService" shows that its offered/enabled to the VM from the host, so what else can i try? Is there some specific systemd service that is responsible for hyperv time/shutdown handling or is this some kernel only thing?
PS: I will try to compare this with a ubuntu VM on the same host, so i can at least figure out if its a clearlinux or hyperv host problem.
I get the same behavior for ubuntu 17.10 on a different host, so it seems this is a general problem utilizing the LIS time service under linux.
I did dig a little deeper and going by this pdf for the latest LIS the hyperv Time-Service needs to-be properly configured to-be used as time source.
Going by the docs the timesource is installed and working via kernel.
ls /sys/class/ptp
ptp0
cat /sys/class/ptp/ptp0/clock_name
hyperv
The simplest option suggested is to disable ntp and switch to the timesync source via:
echo Y > /sys/module/hv_utils/parameters/timesync_mode
Yet this does not work on clear or ubuntu, since the "parameters" is not present under ubuntu (/sys/module/hv_utils/parameters), while in clear "/sys/module/hv_utils" is not present at all?
The more complex option suggested is to use chronyd instead of ntpd, since the later does not support the ptp source.
/etc/chrony.conf:
refclock PHC /dev/ptp0 poll 3 dpoll -2 offset 0
I tried this under ubuntu and clear and enabled the service and restarted it. Yet under clear i'm still unsure how are simple config changes handled? Do i copy the /use/share/defaults/chrony/chrony.conf to /etc/chrony.conf and add the extra parameter or do i just create the /etc/chrony.conf and somehow its merged with the default?
I tried both and on ubuntu/clear i get:
chronyc sources
210 Number of sources = 5
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#~ PHC0 0 3 7 9 -145.5s[-145.5s] +/- 349ns
Yet in both systems i get no time updates after a sleep/resume cycle. I would prefer the "echo Y > /sys/module/hv_utils/parameters/timesync_mode " option, yet i have no clue how to enable this. At this point i also have no idea why the chrony approach does not work and i would have to test this all under the official supported centOS using the latest LIS as well and verify/cross-reference both solution's.
Btw what LIS version is clear using atm? (LIS 4.2.4-1 seems to-be the latest version)
Maybe someone with a better understanding of this can take a look, since this all would suggest that time synchronization is "broken" for all hyperv images regarding sleep/resume or save/restore states.
PS: Maybe as a work around systemd needs to update via ntp immediately after detecting a resume state, in addition to the default ntp poll rate. I guess this could at least work for a resume state, not sure about a hyperv restore/save operation, i guess those are transparent to systemd and only LIS understands those?
Let me rephrase. In addition to checking on the host via "Get-VMIntegrationService", could you please try from the Hyper-V Manager GUI, under Settings for the VM, verifying that Time synchronization is checked under Integration Services? I know it should be the same, but something's not right.
I've tried replicating this on my own system, and I left date
running in a loop, then saved the VM yesterday afternoon. When I resumed it this morning, the time rolled over exactly as expected. Given that you're having trouble with multiple client VMs, it's more likely that something's not quite right on your host.
As I mentioned, hv_utils is built into the kernel, not as a standalone module. This is why you don't see it in /sys/module. Time sync should work out of the box, without having to pass an extra parameter or configure anything in userspace.
I do have one other idea. By default, we include a user-space SNTP client, systemd-timesyncd
. In my environment, it can't reach any NTP servers, so it does nothing. If it could reach a server, maybe it could interfere with LIS. You can check its status with timedatectl
, and if it shows Network Time or NTP synchronized as yes
, we could try some additional actions. Also, if you have manually configured any other time services, please make sure they're disabled. Having multiple time services trying to set the clock (with their own heuristics for slewing vs. stepping) can cause a lot of problems.
@bwarden Yes the service is set enabled (checked) from the management gui. I also have a working NTP aka:
timedatectl:
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
I just noticed something:
then saved the VM yesterday afternoon. When I resumed it this morning, the time rolled over exactly as expected.
I'm not manually saveing the VM, i let the host windows system go into its normal sleep/suspend/hibernate state, while the VM is running and than just resume the host. Maybe i was just too naive assuming such a scenario is covered by hyperv + LIS ? I assumed that LIS would detect the difference between the resumed host system clock and the restored linux VM and forces a update.
Ah, fantastic data point. I've been running this:
while true; do date; dmesg -c | grep timesync; sleep 1; done
...on the Clear Linux VM to show time discontinuities and the logs from hv_utils indicating receipt of messages from the TimeSync service.
When I save/suspend the VM, I can see the time jump on resume. When I suspend the host, I reproduce your problem -- the time continues from where it left off, and most importantly, there are no messages from TimeSync. At this point it looks like this might be a problem with Hyper-V itself not sending the messages, since we know the guest VM can receive them. I'll look into whether this is a known issue with Hyper-V or LIS.
As an interesting data point, I setup chrony to track the PTP clock device exposed by LIS. This is merely a workaround, but it's less clunky than running an NTP client in a guest VM.
I put this in /etc/chrony.conf (which completely overrides the system defaults from /usr/share/defaults/chrony/chrony.conf):
refclock PHC /dev/ptp0
I disabled systemd-timesyncd (the SNTP client):
systemctl mask --now systemd-timesyncd
I enabled chrony:
systemctl enable --now chronyd
With chrony running, I suspended, waited, and resumed my host. While the time initially didn't match, chrony noticed the disparity and gracefully accelerated the clock to make up lost time within a couple of minutes, as verified in the logs, via:
journalctl -u chronyd
which reported:
System clock wrong by 16.715141 seconds, adjustment started
You could also add this to chrony.conf to make it step immediately on errors larger than one second:
makestep 1 -1
I found anecdotes like this from people with similar experiences, but I haven't found any official documentation. I would guess that it's just assumed you would suspend a guest VM properly before suspending or shutting down a host.
@bwarden Thanks for looking into this issue, i could reproduce your fix and was apparently just missing the "makestep 1 -1" option to allow the large changes. I now get:
System clock wrong by 31939.769192 seconds, adjustment started
System clock was stepped by 31939.769192 seconds
I guess i will close this issue, since its a LIS/Hyperv specific "oddity". I still think its strange that the shutdown LIS service transparently handles host restart/shutdowns, without any manual interactions and can even spin-up the VM again if configured on a restart, yet seems to wrongly handle sleep/suspend. It seems to me that having LIS and a direct communication channel to the host, should be enough to handle also those scenarios, but what do i know 😄
Thanks again for the time diagnosing this and the fix.
I left my host hibernated overnight, and with an error of 54409 seconds, chrony didn't believe the reference clock was accurate, so it didn't update. I'd recommend adding "trust" to the refclock statement so that chrony always believes this clock.
@bwarden Just a quick followup, i had to add ntp servers also, since after 3+ days of having the vm suspended the clock would not forward again. Chrony seemed not to be able to use/find the refclock device for whatever reasons.
Here is my current config that seem to work for anyone having this issues.
/etc/chrony.conf
refclock PHC /dev/ptp0 trust poll 2
makestep 1 -1
maxdistance 16.0
pool pool.ntp.org iburst
driftfile /var/lib/chrony/drift
Is there any useful information in dmesg or the journal? I wonder if there's an issue with the virtual device.
@bwarden Forgot to check those sorry.
Hello. If only taking RTC/PHC as the source in chrony.conf, what is the confluence on Time Cycle in the VM / HyperV ?
I noticed that even if i enable "time synchronization" in the hyper-v integration host options, after i wake my VM host from a sleep state, the date/time is not updated for the VM and the VM has the date/time from when the host was put into sleep mode. At some point the network time synchronization kicks in, but this often takes to long and leaves the system in a "bad" date state. So i have to manually restart the network-time service, to quickly get the date/time updated.
As i understand it the "time synchronization" integration feature should take care of this even without any network-time service. So it seems the hyper-v image is not correctly using or has support for this feature?