QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
532 stars 46 forks source link

Shutting down domain takes a long time #5426

Open heinrich-ulbricht opened 4 years ago

heinrich-ulbricht commented 4 years ago

Qubes OS version R4.0

Affected component(s) or functionality Shutting down a domain.

Brief summary Shutting down one of my domains takes ~40 seconds. Other domains only take ~6 seconds.

To Reproduce Don't know how to reproduce. Works (not) on my machine:tm:.

Expected behavior Domain shuts down fast.

Actual behavior Domain takes ages.

Additional context My domain has 600 GB private storage. I recently increased the size by ~60 GB, ignoring the advice to increase in steps of 10 GB. I just added 60 GB. Now I don't know if the slow shutdown is since then or if it is since I switched the template from Fedora 28 to Fedora 30 or not related at all. It might also be normal for such a size. Somebody familiar with the inner workings might know if some of those things could matter.

Solutions you've tried Don't know where to start, which logs to consult or which tools to use.

Any help to get started diagnosing this is appreciated.

marmarek commented 4 years ago

Take a looks at /var/log/xen/console/guest-VMNAME.log (also available in qubes manager right click menu) for that domain. It should contain individual service stops during domain shutdown, including timestamps - it will give a hint what takes this long.

heinrich-ulbricht commented 4 years ago

@marmarek Looking at the most recent log file for the domain I can see that the log messages report successful shutdown (so it seems) almost immediately after clicking the "Shutdown" button in the Qubes menu. When I click Shutdown and look at the log I immediately see this:

[.[0;32m  OK  .[0m] Reached target .[0;1;39mShutdown.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mFinal Step.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mPower-Off.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mPower-Off.[0m.
[   35.852900] systemd-shutdown[1]: Syncing filesystems and block devices.
[   35.915401] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[   35.922197] systemd-journald[210]: Received SIGTERM from PID 1 (systemd-shutdow).
[   35.968467] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[   35.976329] systemd-shutdown[1]: Unmounting file systems.
[   35.980770] [1619]: Remounting '/usr/lib/modules' read-only in with options 'lowerdir=/tmp/modules,upperdir=/sysroot/lib/modules,workdir=/sysroot/lib/.modules_work'.
[   35.985098] [1620]: Unmounting '/usr/lib/modules'.
[   35.995188] [1621]: Remounting '/' read-only in with options 'discard'.
[   36.017008] EXT4-fs (xvda3): re-mounted. Opts: discard
[   36.027040] systemd-shutdown[1]: All filesystems unmounted.
[   36.027075] systemd-shutdown[1]: Deactivating swaps.
[   36.027192] systemd-shutdown[1]: All swaps deactivated.
[   36.027219] systemd-shutdown[1]: Detaching loop devices.
[   36.027512] systemd-shutdown[1]: All loop devices detached.
[   36.027542] systemd-shutdown[1]: Detaching DM devices.
[   36.172314] reboot: System halted

And then it takes ~40 seconds. No further log entries are added, so the log does not really help. Is this ok? Where would I look further?

brendanhoar commented 4 years ago
  1. What physical storage are you using, SSD or HDD?
  2. Are you configured with the standard LVM-thin, btrfs or standard file-based containers?
  3. What was the initial version of Qubes installed?
  4. Are you running standard or testing repos?
heinrich-ulbricht commented 4 years ago

@brendanhoar

  1. classic HDD
  2. standard LVM-thin in a separate pool
  3. Qubes R4.0
    • did a migration lately, but from 4.0 to 4.0 to increase the size of my main-pool SSD; re-created the large pool on another HDD that SHOULD be the same model as the first one
    • I upgraded the underlying Fedora 28 TemplateVM to Fedora 30 lately
  4. standard Qubes and Fedora repos

Your questions lead me to do some more testing:

Shutdown time currently is ~50 seconds in all cases.

The first time I noticed the long shutdown is after increasing the private storage by a large amount. But it might be that I did not notice the long shutdown time before during regular shutdown. But I'm certain it used to be a lot faster. If I could pinpoint it to a slow HDD or something this would be fine. I'd just like to see some log where the delay can be seen. Maybe it's a timeout or something.

heinrich-ulbricht commented 4 years ago

Just had look into the oldest xen log file I have, which is from Oct 06. It looks different for shutdown:

         Starting Power-Off...
[25187.655862] systemd-shutdow: 25 output lines suppressed due to ratelimiting
[25187.759906] systemd-shutdown[1]: Syncing filesystems and block devices.
[25187.839439] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[25187.842755] systemd-journald[203]: Received SIGTERM from PID 1 (systemd-shutdow).
[25187.874552] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[25187.877111] systemd-shutdown[1]: Unmounting file systems.
[25187.879350] [2493]: Remounting '/usr/lib/modules' read-only in with options 'lowerdir=/tmp/modules,upperdir=/sysroot/lib/modules,workdir=/sysroot/lib/.modules_work'.
[25187.887041] [2494]: Unmounting '/usr/lib/modules'.
[25187.896060] [2495]: Remounting '/' read-only in with options 'discard'.
[25187.923818] EXT4-fs (xvda3): re-mounted. Opts: discard
[25187.927815] systemd-shutdown[1]: All filesystems unmounted.
[25187.927830] systemd-shutdown[1]: Deactivating swaps.
[25187.927869] systemd-shutdown[1]: All swaps deactivated.
[25187.927880] systemd-shutdown[1]: Detaching loop devices.
[25187.928048] systemd-shutdown[1]: All loop devices detached.
[25187.928060] systemd-shutdown[1]: Detaching DM devices.
[25187.954533] xenbus: xenbus_dev_shutdown: device/vbd/51840: Closed != Connected, skipping
[25187.993300] reboot: System halted

The xenbus line is gone in the current log. But then again those changes might be to kernel or xen updates. And the delay is not covered in the log.

heinrich-ulbricht commented 4 years ago

guid.domain.log differs from other domains. It shows:

Icon size: 128x128
libvchan_is_eof
Icon size: 128x128
domain dead
Failed to connect to gui-agent

These log entries seem to be unique to this domain:

domain dead
Failed to connect to gui-agent

Maybe it's just a reporting issue? Somebody not noticing the shutdown already happened?

marmarek commented 4 years ago

Those particular messages are probably unrelated, but I have another idea: during that long shutdown observe output of xl list, specifically state column. Is it -b---- or something else when that long delay happens? If something else, check what version of qubes-libvchan-xen do you have, specifically if 4.0.7 or older - a related change was included there (that could improve this).

heinrich-ulbricht commented 4 years ago

@marmarek During shutdown I can observe the state in xl list change like this over time (domain being the name of my domain):

Note how the domain name at some point is replaced by (null) and the time stops.

It stays like (null): --ps-d (21.0) for about 8 seconds, then the entry disappears while the domain is still shutting down for many more seconds according to the Qubes Domains list.

In my VM I have version 4.0.7 of qubes-libvchan-xen.

[user@domain ~]$ dnf list installed | grep qubes-libvchan-xen
qubes-libvchan-xen.x86_64                   4.0.7-1.fc30                          @qubes-vm-r4.0-current

Does this shed some light?

brendanhoar commented 4 years ago
  1. What model of HDD is this? Seagate or Western Digital? Does it report it supports ... trim or unmap? SHR-based HDDs now sometimes do, and it's possible they take more time to process a blkdiscard than SSDs do (which may be occurring when removing volatile/snapshot volumes during VM shutdown).
  2. By any chance, did you change the qubes-wide default pool from the primary (SSD) pool to the secondary (HDD) pool?
  3. Where is the template for this VM located, SSD or HDD?
  4. When you grep the output of sudo lvs for the name of your running VM and the name of the template it is based on...in which pools do the standard, volatile and snapshot volumes reside?
heinrich-ulbricht commented 4 years ago

@brendanhoar

  1. It's a Western Digital WD10EAVS from years ago. So SHR, trim etc. shouldn't be things to consider (right?)
  2. Yes, I temporarily changed the qubes-wide default pool from the primary (SSD) pool to the secondary pool to restore the large domain to the HDD pool. Then I changed it back to the SSD pool.
  3. The template is located on the SSD.
  4. Find the grep of the running VM below; pool00 is on the SSD, _vgcontoso/_lvcontoso on the HDD
[heinrich@dom0 console]$ sudo lvs | grep contoso
  vm-contoso-root-snap                               qubes_dom0 Vwi-aotz--  21.72g pool00    vm-fedora-30                         56.84                                  
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  39.53                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572293911-back                 98.26                                  
  vm-contoso-private-1572268964-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso                                                   98.26                                  
  vm-contoso-private-1572293911-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572268964-back                 98.26                                  
  vm-contoso-private-snap                            vg_contoso  Vwi-aotz-- 585.94g lv_contoso vm-contoso-private                                 98.26                                  
  vm-contoso-volatile                                vg_contoso  Vwi-aotz--  10.00g lv_contoso 

Do you see anything in there that looks odd?

brendanhoar commented 4 years ago

Thanks.

I had a (now seemingly incorrect) hunch the issue was that blkdiscard was issuing discards* against thin-volumes hosted on the HDD...but that WD drive is around 10 years old, so it definitely does not support TRIM, so that's not the issue.

-Brendan

* Qubes issues blkdiscard before lvremove post-shutdown for volatile/snapshot volumes and/or post-VM-delete for deleted VMs. It's good practice, for several reasons, upper layers to notify lower layers that storage is being de-allocated, and lvm does not currently explicitly do so when lvremove is executed. If all layers have pass-down enabled (by default, Qubes doesn't, but many of us enable it), this leads to TRIM requests being issued to the physical storage layer.

heinrich-ulbricht commented 4 years ago

@brendanhoar By explicitly stating the age of this drive you make me wanna do backups more often... Thanks for the theory and information. The mystery continues.

heinrich-ulbricht commented 4 years ago

I made another observation today while trying to reduce the number of revisions_to_keep for the pool to see if this changes something. (Not sure if this works, I'm still trying to make sense of #3256)

Calling sudo lvs | grep contoso repeatedly while the large domain is shutting down shows that one value in the Data% column decreases steadily until it reaches zero. The "domain has shut down" message pops up in this exact moment when the value reaches zero.

Here is the log of calling sudo lvs | grep contoso three times while the domain is doing its ~50 sec shutdown (Data% being 37.36, then 15.23, then 0.01). By the way, there is a noticeable delay of about ~3 seconds until each of those calls returns.

[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  55.72                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   37.36                                  
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  
[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  62.31                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   15.23                                  
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  
[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  64.77                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   0.01                                   
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  

@marmarek @brendanhoar How would you interpret this? What exactly is happening there? Some kind of diff being calculated? Which maybe just takes time due to the large amount of data on a slow HDD?

heinrich-ulbricht commented 4 years ago

It looks like discarding the oldest snapshot takes its time.

Additionally, I just set the revisions_to_keep to 0 and when shutting down the domain 3 snapshots were discarded (two from the old setting revisions_to_keep=2, one current) - each of them taking ages. I could monitor the progress by calling qvm-volume info contoso:private repeatedly and see the snapshots disappear one after another. Now with 0 revisions it still takes time since one snapshot is still being discarded on shutdown (leaving 0).

Increasing the revisions_to_keep again to a value of 2 gives me a nice fast shutdown since the snapshot does not need to be discarded. Of course only until there are 3 of them again, one of them then being discarded on shutdown.

@marmarek Does this sound right? Is this supposed to take such a long time? Is there a way to speed this up?

marmarek commented 4 years ago

That makes sense. Issuing DISCARD for the whole 580GB volume indeed can take time. LVM internally handle it too, even if underlying block doesn't support it. Specifically - release blocks from the volume and update internal metadata about those blocks (reference counters etc). I see three options how to proceed here:

  1. Don't do anything and accept long shutdown - I think the only things that are slowed down during this time are about the very some qube (for example if you want to start it again). Other qubes should be unaffected.
  2. Don't issue DISCARD if underlying device doesn't support it. Issuing DISCARD here is mostly useful for the underlying SSD disk, not so much for HDD. On the other hand, detecting the support may be tricky (there are multiple layers to consider, including LUKS, LVM itself etc)
  3. Make it more asynchronous, in the background - don't hold further operations of the qube while it's discarding old volume. That volume wouldn't be used in any subsequent operations anyway, so it should be doable, but introducing more concurrency into the process may be risky - for example we'd need to ensure the operation is completed when shutting down the system, or removing the qube.

Do you have a specific use case against assumptions in the first point?

brendanhoar commented 4 years ago

[Edit - I will amend the below after catching up on the previous several posts.]

Some interplay of HDD, /etc/lvm/lvm.conf (and cryptsetup options, if being used) may make it possible that the blkdiscard issued before lvremove might be zeroing out the container files (instead of issuing discards, since the backing store is not trim-capable).

  1. What is the setting of issue_discards= in /etc/lvm/lvm.conf?
  2. What is the setting of thin_pool_discards in /etc/lvm/lvm.conf?
  3. If cryptsetup is being used in an intervening layer between the HDD and the PV/VG/thin volume, is the --allow-discards option being used with the cryptsetup?

Lastly, I wonder if blkdiscard automatically falls back to using zeroout if the volume (real or virtual) does not indicate trim support. That would not be optimal behavior on a thin volume. I may have to go dig out some spinning rust from my closet to do some tests.

Brendan

PS - my one-ish liner on monitoring discards as they happen:

!/bin/bash

watch -n 1 -d \ "if [ -d /sys/block/sda ] ; then pat1=sd pat2=loop; else pat1=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat1/stat /sys/block/\$pat2/stat|awk 'BEGIN {print \"DRIVE IOS QMERGES SECTORS MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\n\",NR,\$12,\$13,\$14,\$15}'"

In dom0, Drive 1 will be /dev/sda, etc., it'll also show discards on loops devices as well. In domU, it uses xvd* instead.

heinrich-ulbricht commented 4 years ago

@brendanhoar

  1. issue_discards in /etc/lvm/lvm.conf is 0
  2. thin_pool_discards in /etc/lvm/lvm.conf is commented out, thus no explicit value set
  3. Where can I check this? I set up the second pool including cryptsetup by following these instructions: https://www.qubes-os.org/doc/secondary-storage/ - no mentioning of discard options there, did configure nothing else

(Just as a side note, and this could be irrelevant: before I can start the (long-shutdown) AppVM in the second pool I have to start another VM (from the SSD pool) first, OR I have to call sudo service qubesd restart. Otherwise I get an error that the pool could not be found (have to reboot to get the exact error message). I think I commented somewhere else a long time ago about this but since then ignored it as this workaround was ok for me. Just wanted to mention this as we are taking deeper looks into the specifics of this pool.)

Thanks to both of you for helping me out here.

heinrich-ulbricht commented 4 years ago

@marmarek First and foremost I want to know if something is broken. If it is broken I want to repair it, if possible. If it is not broken then I want to know the options and the effort it takes (and for whom) to implement those. If it's too much hassle I can live with it. I'm currently on a dual-boot(*) machine and unfortunately have to switch between operating systems quite often. So every time I shut down Qubes I have this minute-long wait :-/ It's just annoying.

(*) with a physical SATA switch, so no "realy" dual-boot that could be of relevance to the issue

marmarek commented 4 years ago

First and foremost I want to know if something is broken.

I wouldn't call it broken. Rather sub-optimal performance.

So every time I shut down Qubes I have this minute-long wait :-/

This rules out my third idea, as you'd still need to wait during shutdown.

What is the setting of issue_discards= in /etc/lvm/lvm.conf?

I believe this setting applies to non-thin volumes only.

What is the setting of thin_pool_discards in /etc/lvm/lvm.conf?

This indeed is relevant for thin volumes, possible values: ignore, nopassdown, passdown. If lower layer doesn't support discard, nopassdown and passdown are effectively equivalent. Default is passdown.

3\. If cryptsetup is being used in an intervening layer between the HDD and the PV/VG/thin volume, is the --allow-discards option being used with the cryptsetup?

See https://www.qubes-os.org/doc/disk-trim/#luks

johnnyboy-3 commented 4 years ago

I have the very same problem discussed here https://groups.google.com/forum/m/#!topic/qubes-users/-ZnZM6rS4hs Problem exists even when setting revisions_to_keep to 0. Workaround for me is to use btrfs

Btrfs also works great. I noticed that dom0 really hogs up the cpu after shutdown of huge appvms. For some reasons qubes still creates a <timestamp>-back revision just to delete it afterwards again, though revisions_to_keep is set to 0. Using btrfs i don't even need to set that config to 0, it shutdowns in like 1-2 seconds.

iamahuman commented 3 years ago

While shutting down a VM, I observe freezing in other VMs on the same pool as well. During the freeze, A kworker/**-dm-thin kernel thread takes 100% of CPU. I suppose this is the same issue?

noskb commented 3 years ago

I'm using an SSD but I had the same problem and it was only occurred on SanDisk SDSSDXPS480G and not on Samsung SSD 860 EVO. Of course my domain has a large size of private storage.

In conclusion, it was solved by changing the discard mode of LVM thin pool from passdown to nopassdown.

sudo lvchange --discards nopassdown qubes_dom0/pool00 in dom0 terminal

This will restore the physical space to the thin pool as in the past, but no TRIM command will be issued to the physical device.

from the lvmthin manpage

Discard

   The discard behavior of a thin pool LV determines how discard
   requests are handled.  Enabling discard under a file system may
   adversely affect the file system performance (see the section on
   fstrim for an alternative.)  Possible discard behaviors:

   ignore: Ignore any discards that are received.

   nopassdown: Process any discards in the thin pool itself and allow
   the no longer needed extents to be overwritten by new data.

   passdown: Process discards in the thin pool (as with nopassdown), and
   pass the discards down the the underlying device.  This is the
   default mode.

The default "passdown" may cause performance degradation in some cases.

marmarek commented 3 years ago

This sounds like your SSD driver has very poor implementation of "TRIM" operation. "nopassdown" indeed may improve performance in this case, but depending on SSD model, may reduce the disk life time. It is also possible that this disk firmware is stupid enough it won't matter at all besides performance. Hard to tell without doing some extensive tests or analyzing disk firmware.

brendanhoar commented 3 years ago

At a minimum, Qubes should document possible performance issues under LVM on VM shutdown (during volatile volume/snapshot removals) with both SMR HDs generally and SSDs with poor firmware. Docs should recommend “nopassdown “ in lvm.conf as the workaround, noting the caveats for SSDs above.

I was going to recommend exposing this option via the qubes global settings but thinking it through, with all the custom storage layouts out there it might get messy, so that’s probably not a great idea.

[I say this all as the instigator (or at least one of them) of adding the blkdiscard invocation to lvm volume removals. I still think the benefits outweigh the other issues.]

B

andrewdavidwong commented 3 years ago

At a minimum, Qubes should document possible performance issues under LVM on VM shutdown (during volatile volume/snapshot removals) with both SMR HDs generally and SSDs with poor firmware. Docs should recommend “nopassdown “ in lvm.conf as the workaround, noting the caveats for SSDs above.

In case you (or any others reading this) are not already aware, the documentation is a community effort, and everyone is welcome to contribute. (That's how things like this get updated!) So, if you'd like to get involved with the project, this is a great way to do it. You can read more about how to submit documentation changes here:

https://www.qubes-os.org/doc/doc-guidelines/

iamahuman commented 3 years ago

dm-thin deletes the mapping nodes one at a time, and ftrace tells me the kernel is spending all the time rebalancing the btree and locking/unlocking nodes.

It certainly needs to get smarter and just delete the common ancestors--which in turn would automatically decrement refcounts (in spacemap?) of all the blocks allocated for child nodes via callbacks, if I read the kernel code correctly.

The most annoying thing about this bug is the total I/O freeze of other volumes in the same pool, resulting in other VMs potentially ceasing to respond for a few seconds.

iamahuman commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Or patch dm-thin, but I think it might be a waste of time...

iamahuman commented 3 years ago

Related: #5077 Introduced in: QubesOS/qubes-core-admin#267 (QubesOS/qubes-core-admin@62e3d571206f134adf6989345962fd92f103e2f3)

iamahuman commented 3 years ago

How about performing blkdiscard asynchronously? Possible solution would be to rename the LV to be removed to vm-*-stale, put it to reclaim queue, do blkdiscard, and finally run lvremove on it.

iamahuman commented 3 years ago

6476 will resolve this.

DemiMarie commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Agreed, although I am wondering why you say this.

Or patch dm-thin, but I think it might be a waste of time...

Why would it be a waste of time?

iamahuman commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Agreed, although I am wondering why you say this.

Generally, the apparant lack of interest from maintainers. I might have mistaken, though.

Or patch dm-thin, but I think it might be a waste of time...

Why would it be a waste of time?

Perhaps not completely worthless; I merely meant that It might not be worth the time in face of an alternative.

DemiMarie commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Agreed, although I am wondering why you say this.

Generally, the apparant lack of interest from maintainers. I might have mistaken, though.

In what areas? dm-thin is a key part of LVM2, which is at the core of Red Hat’s storage offerings. But I strongly suspect that their needs and ours do not match up very well, especially when it comes to startup and shutdown performance.

iamahuman commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Agreed, although I am wondering why you say this.

Generally, the apparant lack of interest from maintainers. I might have mistaken, though.

In what areas? dm-thin is a key part of LVM2, which is at the core of Red Hat’s storage offerings. But I strongly suspect that their needs and ours do not match up very well, especially when it comes to startup and shutdown performance.

Yes, that part exactly. IIRC this issue has been brought up in dm-devel ML a few times, with no response. EDIT: Turns out my memory served me wrong. I'm not sure where I've got such impression. My apologies.

DemiMarie commented 3 years ago

After reading the device-mapper implementation myself, plus per various online sources, I believe the solution (not workaround) is to just use Btrfs or XFS. Both have been around for 10+ years.

Agreed, although I am wondering why you say this.

Generally, the apparant lack of interest from maintainers. I might have mistaken, though.

In what areas? dm-thin is a key part of LVM2, which is at the core of Red Hat’s storage offerings. But I strongly suspect that their needs and ours do not match up very well, especially when it comes to startup and shutdown performance.

Yes, that part exactly. IIRC this issue has been brought up in dm-devel ML a few times, with no response.

I wonder how much of the problem is dm-thin and how much is LVM2. LVM2’s metadata format is textual and particularly inefficient. We could probably do much better if we used dm-thin directly.

DemiMarie commented 3 years ago

Yes, that part exactly. IIRC this issue has been brought up in dm-devel ML a few times, with no response.

Do you have any links?

DemiMarie commented 3 years ago

On further thought: according to https://lore.kernel.org/linux-lvm/540a8738-e09d-b80d-ba1b-bd91b3d6c3d5@redhat.com/ LVM will automatically perform the equivalent of blkdiscard if issue_discards = 1 is set in lvm.conf.

marmarek commented 3 years ago

LVM2’s metadata format is textual and particularly inefficient.

There is lvmetad daemon that caches the parsed metadata, though. I haven't analyzed how good it performs, but I believe it should avoid most performance issues caused by loading metadata each time (whatever the format). My understanding of very basic performance analysis says the most performance impact is rather in a) various locking and b) synchronization with udev. There are options to configure or even bypass either of them, but I haven't came to any solution that is both faster and still safe.

iamahuman commented 3 years ago

I wonder how much of the problem is dm-thin

See https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524.

and how much is LVM2.

Not very much relevant to performance here, I guess.

LVM2’s metadata format is textual and particularly inefficient.

You mean the XML format used by thin_dump / thin_restore?

We could probably do much better if we used dm-thin directly.

The actual sector mappings are stored as B-tree in the meta LV, which the lvm tool doesn't really care much.

marmarek commented 3 years ago

LVM will automatically perform the equivalent of blkdiscard if issue_discards = 1 is set in lvm.conf.

I believe it affects only fully provisioned LVs, not thin provisioned ones. But let me double check.

DemiMarie commented 3 years ago

LVM2’s metadata format is textual and particularly inefficient.

There is lvmetad daemon that caches the parsed metadata, though. I haven't analyzed how good it performs, but I believe it should avoid most performance issues caused by loading metadata each time (whatever the format). My understanding of very basic performance analysis says the most performance impact is rather in a) various locking and b) synchronization with udev. There are options to configure or even bypass either of them, but I haven't came to any solution that is both faster and still safe.

Can the locking be removed if we wrap a global lock around all LVM calls? Also we can use the LVM shell (with JSON output).

marmarek commented 3 years ago

I believe it affects only fully provisioned LVs, not thin provisioned ones. But let me double check.

Yup, it is the case.

```shell_session bash-5.0# truncate -s 500M pv.img bash-5.0# losetup /dev/loop0 ./pv.img bash-5.0# vgcreate vg0 /dev/loop0 Physical volume "/dev/loop0" successfully created. Volume group "vg0" successfully created bash-5.0# lvcreate --type thin-pool -n pool0 -l 100%FREE vg0 Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data. Logical volume "pool0" created. bash-5.0# lvcreate --thin-pool=pool0 -V 300M -n vol1 vg0 Logical volume "vol1" created. bash-5.0# ls -ls pv.img 8320 -rw-r--r-- 1 root root 524288000 Mar 31 17:10 pv.img bash-5.0# cat /dev/zero > /dev/vg0/vol1 cat: write error: No space left on device bash-5.0# ls -lsh pv.img 309M -rw-r--r-- 1 root root 500M Mar 31 17:11 pv.img bash-5.0# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert pool0 vg0 twi-aotz-- 488.00m 61.48 14.65 vol1 vg0 Vwi-a-tz-- 300.00m pool0 100.00 bash-5.0# blkdiscard /dev/vg0/vol1 bash-5.0# ls -lsh pv.img 8.2M -rw-r--r-- 1 root root 500M Mar 31 17:11 pv.img bash-5.0# cat /dev/zero > /dev/vg0/vol1 cat: write error: No space left on device bash-5.0# ls -lsh pv.img 309M -rw-r--r-- 1 root root 500M Mar 31 17:11 pv.img bash-5.0# lvmconfig devices/issue_discards issue_discards=1 bash-5.0# ls -lsh pv.img 309M -rw-r--r-- 1 root root 500M Mar 31 17:12 pv.img bash-5.0# lvremove vg0/vol1 Do you really want to remove and DISCARD active logical volume vg0/vol1? [y/n]: y Logical volume "vol1" successfully removed bash-5.0# ls -lsh pv.img 309M -rw-r--r-- 1 root root 500M Mar 31 17:13 pv.img ```
marmarek commented 3 years ago

Can the locking be removed if we wrap a global lock around all LVM calls? Also we can use the LVM shell (with JSON output).

As I've said, I've tried various options, but it didn't result in any significant performance improvement. If you find a specific set of options that do improve performance, I'm all ears.

brendanhoar commented 3 years ago

Due to lvm_thin not issuing discards down the device chain when removing a volume/snapshot, an explicit blkdiscard invocation was added during the volume removal process during the qubes 4.0 lifecycle.

Is the performance issue related to this explicit blkdiscard?

If so, perhaps a global (or pool? or VM?) flag could be supported for those who want to disable blkdiscard for various reasons.

B

DemiMarie commented 3 years ago

dm-thin deletes the mapping nodes one at a time, and ftrace tells me the kernel is spending all the time rebalancing the btree and locking/unlocking nodes.

It certainly needs to get smarter and just delete the common ancestors--which in turn would automatically decrement refcounts (in spacemap?) of all the blocks allocated for child nodes via callbacks, if I read the kernel code correctly.

The most annoying thing about this bug is the total I/O freeze of other volumes in the same pool, resulting in other VMs potentially ceasing to respond for a few seconds.

I wonder how hard a kernel patch would be.

iamahuman commented 3 years ago

Yes, that part exactly. IIRC this issue has been brought up in dm-devel ML a few times, with no response.

Do you have any links?

Turns out my memory served me wrong, although I'm not sure where I've got such impression. My apologies.

That said, here's some links I've been exploring:

iamahuman commented 3 years ago

dm-thin deletes the mapping nodes one at a time, and ftrace tells me the kernel is spending all the time rebalancing the btree and locking/unlocking nodes. It certainly needs to get smarter and just delete the common ancestors--which in turn would automatically decrement refcounts (in spacemap?) of all the blocks allocated for child nodes via callbacks, if I read the kernel code correctly. The most annoying thing about this bug is the total I/O freeze of other volumes in the same pool, resulting in other VMs potentially ceasing to respond for a few seconds.

I wonder how hard a kernel patch would be.

One way that comes to my mind is to locate the "dominators" of all the leaf nodes in the range, and call dm_btree_del on them. This way, we can save much of re-balancing work while preserving refcount book-keeping.

Another issue that bothers me is that dm-thin performs discard in a "best effort" basis i.e. whenever there is OOM or heavy contention, it may abandon certain discard BIOs without remembering to do them later. It also won't attempt to retry discard operations that had been aborted due to e.g. a reboot or crash. BTRFS is apparently more reliable in this respect.

rustybird commented 3 years ago

To the people who experience these shutdown delays: If you shut down many VMs in parallel with qvm-shutdown --wait vm1 vm2 vm3..., is the total delay the same as the sum of the individual delays, less, or more?

DemiMarie commented 3 years ago

dm-thin deletes the mapping nodes one at a time, and ftrace tells me the kernel is spending all the time rebalancing the btree and locking/unlocking nodes. It certainly needs to get smarter and just delete the common ancestors--which in turn would automatically decrement refcounts (in spacemap?) of all the blocks allocated for child nodes via callbacks, if I read the kernel code correctly. The most annoying thing about this bug is the total I/O freeze of other volumes in the same pool, resulting in other VMs potentially ceasing to respond for a few seconds.

I wonder how hard a kernel patch would be.

One way that comes to my mind is to locate the "dominators" of all the leaf nodes in the range, and call dm_btree_del on them. This way, we can save much of re-balancing work while preserving refcount book-keeping.

Would you be interested in creating a kernel patch for this?

it is also possible (using thin_trim) to perform the equivalent of fstrim for a thin pool. This can only be done when the pool is not in use, but QubesOS could do it from the initramfs.

ejose19 commented 3 years ago

In case this is useful, following @iamahuman https://github.com/QubesOS/qubes-issues/issues/3639#issuecomment-787459236, I've inspected what is causing near 100% single-core usage when shutting down the problematic VM, and here's the output of sudo perf record -g -a -- qvm-shutdown --wait slow-vm:

Samples: 2M of event 'cpu-clock', Event count (approx.): 675294250000
  Children      Self  Command          Shared Object                                     Symbol
+   91.85%    91.84%  swapper          [kernel.kallsyms]                                 [k] xen_hypercall_sched_op
+    5.93%     0.00%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] ret_from_fork
+    5.93%     0.00%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] kthread
+    5.93%     0.00%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] worker_thread
+    5.93%     0.00%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] process_one_work
+    5.93%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] do_worker
+    5.15%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] process_discard_bio
+    5.15%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] process_discard_cell_passdown
+    4.89%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] process_prepared_discard_passdown_pt1
+    3.87%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] dm_thin_remove_range
+    3.72%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_btree_remove_leaves
+    3.69%     1.17%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_tm_shadow_block
+    3.18%     0.03%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_ll_mutate
+    2.14%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] shadow_step
+    2.08%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] rebalance_children
+    1.93%     0.09%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] new_read
+    1.85%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] init_child.isra.4
+    1.81%     0.03%  kworker/u24:8-k  [dm_persistent_data]                              [k] insert
+    1.71%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] disk_ll_save_ie
+    1.61%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_disk_inc_block
+    1.58%     0.08%  kworker/u24:8-k  [dm_persistent_data]                              [k] btree_insert_raw
+    1.57%     0.10%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_bm_read_lock
+    1.54%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_disk_dec_block
+    1.50%     0.06%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_ll_lookup_bitmap
+    1.46%     1.23%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] __bufio_new
+    1.40%     0.09%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_metadata_count_is_more_than_one
+    1.12%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_btree_lookup
+    1.00%     0.04%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_bm_write_lock
+    0.94%     0.05%  kworker/u24:8-k  [dm_persistent_data]                              [k] btree_lookup_raw.constprop.7
+    0.82%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] dm_pool_inc_data_range
+    0.79%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] inc_children
+    0.76%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] process_prepared
+    0.71%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] ro_step
+    0.67%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] disk_ll_load_ie
+    0.64%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] rebalance2
+    0.63%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_bm_unlock
+    0.54%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] process_prepared_discard_passdown_pt2
+    0.54%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] dm_pool_dec_data_range
     0.49%     0.49%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] xen_hypercall_xen_version
     0.47%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] dm_thin_find_mapped_range
     0.43%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] __find_block
     0.34%     0.34%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] __raw_callee_save___pv_queued_spin_unlock
     0.32%     0.24%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] mutex_lock
     0.27%     0.02%  kworker/u24:8-k  [dm_persistent_data]                              [k] bl_up_read
     0.23%     0.04%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] dm_bufio_release
     0.23%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] dm_btree_remove
     0.23%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] exit_shadow_spine
     0.22%     0.19%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] __relink_lru
     0.22%     0.22%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] mutex_unlock
     0.21%     0.21%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] _raw_spin_lock
     0.20%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] passdown_double_checking_shared_status
     0.19%     0.19%  kworker/u24:8-k  [dm_persistent_data]                              [k] __add_holder
     0.19%     0.00%  kworker/u24:8-k  [dm_thin_pool]                                    [k] dm_pool_block_is_shared
     0.19%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] sm_ll_lookup
     0.19%     0.00%  kworker/u24:8-k  [dm_persistent_data]                              [k] remove_raw
     0.18%     0.18%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] xen_restore_fl_direct
     0.18%     0.03%  kworker/u24:8-k  [dm_persistent_data]                              [k] bl_down_write
     0.17%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] exit_ro_spine
     0.16%     0.01%  kworker/u24:8-k  [dm_persistent_data]                              [k] bl_up_write
     0.15%     0.01%  Xorg             libc-2.24.so                                      [.] __GI___ioctl
     0.15%     0.15%  kworker/u24:8-k  [dm_persistent_data]                              [k] __del_holder
     0.14%     0.12%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] _cond_resched
     0.13%     0.07%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] __flush_write_list
     0.12%     0.00%  kworker/u24:10-  [kernel.kallsyms]                                 [k] ret_from_fork
     0.12%     0.00%  kworker/u24:10-  [kernel.kallsyms]                                 [k] kthread
     0.12%     0.00%  kworker/u24:10-  [kernel.kallsyms]                                 [k] worker_thread
     0.12%     0.00%  kworker/u24:10-  [kernel.kallsyms]                                 [k] process_one_work
     0.12%     0.00%  libvirtd         [unknown]                                         [.] 0000000000000000
     0.12%     0.00%  libvirtd         libc-2.24.so                                      [.] __GI___ioctl
     0.12%     0.12%  libvirtd         [kernel.kallsyms]                                 [k] xen_hypercall_domctl
     0.12%     0.00%  libvirtd         [unknown]                                         [k] 0x0000000000000016
     0.12%     0.12%  kworker/u24:8-k  [dm_persistent_data]                              [k] lower_bound
     0.11%     0.00%  Xorg             [kernel.kallsyms]                                 [k] entry_SYSCALL_64_after_hwframe
     0.11%     0.00%  Xorg             [kernel.kallsyms]                                 [k] do_syscall_64
     0.11%     0.00%  Xorg             [kernel.kallsyms]                                 [k] __x64_sys_ioctl
     0.11%     0.07%  kworker/u24:8-k  [kernel.kallsyms]                                 [k] dm_bufio_mark_partial_buffer_dirty
     0.11%     0.00%  Xorg             [nouveau]                                         [k] nouveau_drm_ioctl
     0.11%     0.11%  swapper          [kernel.kallsyms]                                 [k] xen_hypercall_xen_version
     0.11%     0.00%  Xorg             [drm]                                             [k] drm_ioctl
     0.10%     0.00%  Xorg             [drm]                                             [k] drm_ioctl_kernel
     0.10%     0.00%  kworker/u24:10-  [dm_thin_pool]                                    [k] do_worker