3hhh commented 2 years ago

Qubes OS release

4.1rc2

Brief summary

Since switching from 4.0 to 4.1rc2 I'm experiencing a significant slowdown of all VM start and shutdown operations.

Shutdowns (especially of dispVMs) may cause the system to freeze entirely for 1-2s even though the CPU load doesn't seem high (I/O only?).

The system is identical, except the hard disk: btrfs/reflink is installed on a newer Sandisk Ultra 3D SSD (ext4/file on some old Lenovo standard SSD).

Steps to reproduce

Install Qubes OS 4.1rc2 from ISO (tested with lvm thin and btrfs/reflink).
Run the following script in dom0 (I tested with debian-11 default template):
```
#!/bin/bash
```

qvm-shutdown --all --wait sudo swapoff -a swap="$(sudo mount|grep swap)" [ -z "$swap" ] || { echo "Please disable swap first." ; exit 2 ; }

for ((i=0;i<13;i++)) ; do vm="tvm-$i" qvm-remove -f "$vm" done

for ((i=0;i<13;i++)) ; do vm="tvm-$i" echo "Creating $vm..." qvm-create -l red "$vm" qvm-prefs "$vm" netvm "" qvm-prefs "$vm" memory 400 qvm-prefs "$vm" maxmem 0

echo "Starting $vm..."

beg=$SECONDS

sudo perf record -a --call-graph fp -o "${vm}_perf.data" qvm-run -p "$vm" 'echo hello'

echo "duration: $(($SECONDS - $beg))"

time qvm-run -p "$vm" 'echo hello' xl info | grep free_memory done

qvm-shutdown --all --wait

for ((i=0;i<13;i++)) ; do vm="tvm-$i" qvm-remove -f "$vm" done



### Expected behavior

VM startup times remain constant.

### Actual behavior

Startup times linearly increase. Druing my lvm thin test they increased from 13 to 22s.

DemiMarie commented 2 years ago

The system is identical, except the hard disk: btrfs/reflink is installed on a newer Sandisk Ultra 3D SSD (ext4/file on some old Lenovo standard SSD).

Can you do a file-vs-reflink benchmark on the same hardware? Some new SSDs have quite slow write performance due to using TLC or even QLC.

3hhh commented 2 years ago

Can you do a file-vs-reflink benchmark on the same hardware? Some new SSDs have quite slow write performance due to using TLC or even QLC.

You mean from exactly the same hard disk?

That would require quite a lot of dd'ing & time, but let's see... Anyway I'd do it at most once...

3hhh commented 2 years ago

Hm so I decided to try a btrfs/file pool first, but that one had the same performance issues on the newer SSD. Around 24-32s for time qvm-run -p vm 'echo hello' on 4.1rc2.

hdparm reports write speeds of around 500 MB/s, dd reports 250 MB/s to a btrfs file on that disk. Read speed is 5-8 GB/s according to hdparm.

rustybird commented 2 years ago

Could it be a system wide R4.1 performance issue unrelated to storage (or more precisely, only correlated to storage via dm-crypt)? On an early prealpha I was affected by probably some variant of #4604. Everything was dog slow because my CPU stuck was stuck at the lowest frequency.

It might be worth it to try an ext4/file R4.1 installation for comparison if you have a spare drive.

Grasping for straws: If it's truly a storage issue, maybe the drive is completely starved for available blocks so that blkdiscard and then reinstalling would help.

DemiMarie commented 2 years ago

XFS/reflink is another stack to try. XFS is known for high performance.

rustybird commented 2 years ago

maybe the drive is completely starved for available blocks

On second thought, I think this can be ruled out as you're still getting 250 MB/s writes. That number seems low (if the dd benchmark is in dom0 and with a large enough bs= argument) - but not terrible.

ghost commented 2 years ago

x220 with Samsung EVO 860

btrfs

start untrusted vm - 0m13.838s
shutdown untrusted vm - 0m0.220s (minimal vm x4 faster, 0m0.066s)

3hhh commented 2 years ago

So I tested the fresh 4.1rc2 btrfs/reflink on the old SSD:

time qvm-run -p vm 'echo hello': 24-30s dd write speed to a btrfs file: 200 MB/s

So now that's two 100% identical systems, one with 4.1 btrfs/reflink and one with 4.0 ext4/file. I'll try 4.0 ext4/file on the new SSD as well.

3hhh commented 2 years ago

@0xdd7fq Do you run reflink or lvm? Please check your qvm-pool info varlibqubes driver output...

ghost commented 2 years ago

@3hhh

default btrfs install on x220 with 1.5G dom0 mem, imagine speed on thinkpad nvme mod...

driver : file-reflink

btrfs fi sh

Label: 'qubes_dom0'  uuid: xxx
    Total devices 1 FS bytes used 25.34GiB
    devid    1 size 926.55GiB used 28.02GiB path /dev/mapper/luks-xxx

DemiMarie commented 2 years ago

What about XFS/reflink?

marmarek commented 2 years ago

the old SSD

can it be painfully slow DISCARD op handling by the disk?

3hhh commented 2 years ago

So I now tried 4.0 ext4/file on the new disk: time qvm-run -p vm 'echo hello': 10-11s dd write speed to an ext4 file: 220 MB/s

So the type of disk doesn't matter / it's unlikely a hardware issue.

I'll try upgrading the 4.0 ext4/file install to 4.1 to see what happens.

Is there any way to test discards?

3hhh commented 2 years ago

I'll try upgrading the 4.0 ext4/file install to 4.1 to see what happens.

Forget about that. It currently quite specatularly fails after a second for ext4/file pools:

qubes-dist-upgrade --all
WARNING: /!\ ENSURE TO HAVE MADE A BACKUP OF ALL YOUR VMs AND dom0 DATA /!\
-> Launch upgrade process? [y/N] y
  Failed to find logical volume "qubes_dom0/_tmeta"
  Failed to find logical volume "qubes_dom0/"
/sbin/qubes-dist-upgrade: line 417: [: : integer expression expected
/sbin/qubes-dist-upgrade: line 576: [: : integer expression expected
---> (STAGE 0) Adjusting LVM thin pool metadata size...
  Failed to find logical volume "qubes_dom0/"
/sbin/qubes-dist-upgrade: line 417: [: : integer expression expected
  --size may not be zero.
  Run `lvextend --help' for more information.
-> Launch restoration...

Apparently qubes-dist-upgrade silently assumes lvm only.

EDIT: Ah, it seems to be stage0, will try to skip that one...

3hhh commented 2 years ago

Giving up on the upgrade approach as qubes-dist-upgrade broke the system. I'll mention the issues at #5685 .

3hhh commented 2 years ago

What about XFS/reflink?

Does it work from the installer now?

Last time I tried I had a really hard time to get the anaconda custom partitioning right for btrfs/reflink. I recall my first attempt resulted in a btrfs/lvm install without encryption and I finally chose the default btrfs/reflink option...

3hhh commented 2 years ago

Anyway in total my previous results indicate it's either a btrfs, a reflink driver or a generic Qubes 4.1 issue.

Apparently however only a certain subset of users is affected.

marmarek commented 2 years ago

As for the start time itself, there may be a fallout from the #7023 fix, affecting start time regardless of storage. Try starting a VM with memory==maxmem (for example set both to 600).

rustybird commented 2 years ago

What about XFS/reflink?

Does it work from the installer now?

Yes, since R4.0.4 and R4.1.0-beta1 the installer will create /var/lib/qubes on XFS when you switch to the "Standard Partition" scheme + "Click here to create them automatically." (It's also still possible to then change the /var/lib/qubes partition to ext4 under "File System:" if you want to test with a legacy 'file' driver pool.)

Last time I tried I had a really hard time to get the anaconda custom partitioning right for btrfs/reflink.

Anything except the automatic schemes is a minefield for sure.

3hhh commented 2 years ago

As for the start time itself, there may be a fallout from the #7023 fix, affecting start time regardless of storage. Try starting a VM with memory==maxmem (for example set both to 600).

No, unfortunately that's not it. All of my tests incl. 4.0 were with mem=1500, maxmem=0. I just tested with maxmem=mem=1500 on 4.1 btrfs/reflink and the results are identical (~30s startup).

3hhh commented 2 years ago

Hmmm so I masochistically decided to test all pool types.

I installed plain Qubes 4.1rc2 4 times on the same machine with only debian-11 and fedora-34 templates and then ran time qvm-run -p debian-11/fedora-34: All behaved similary at 11-14s startup times at ~260 MB/s dd write speeds. (xfs/reflink and ext4/file on average were a second or so better than btrfs/reflink and lvm thin).

So it's not a btrfs/reflink specific issue. On the contrary it indicates that my usage pattern must cause the issue.

I also recalled that I had noticed slight swapping in dom0 of 50-150MB even though my dom0 memory was at 30% of 3 GB (I had changed that from the default 4 GB) and thus disabled swapping (why does it do swapping at 30%?).

That now reduced startup times to 19-25s, but still not the 11-14s of an empty installation (on the same machine, older SSD though). I'll test whether startup times are better for the first VM to be started...

3hhh commented 2 years ago

I'll test whether startup times are better for the first VM to be started...

Bullseye!

After qvm-shutdown --all I get 14-15s on btrfs/reflink on my fully installed machine as well. Usually however I had ~12 VMs running when I tested. I guess @0xdd7fq tested with less.

So Qubes 4.1rc2 btrfs/reflink startup performance degrades with a higher number of running VMs. That didn't happen in 4.0 ext4/file. Still not sure whether it affects btrfs/reflink only or is a generic issue independent of the storage driver. I still have the lvm thin empty install and check on that...

DemiMarie commented 2 years ago

@3hhh can you check with xfs/reflink too?

3hhh commented 2 years ago

lvm thin is affected: I created and ran 13 VMs one after another on the empty 4.1rc2 install and startup times slowed down from 13s to 22s for the last VM.

So this should be a generic issue. I'll update title & description accordingly and leave the Xen fixing to you guys. ;-)

ghost commented 2 years ago

@3hhh @DemiMarie

ok, i have second machine with xeon and 64gb ram, 5.14 kernel, btrfs, ssd 860 evo, not nvme. 10 ident standalone gentoo gnome vm ( vCpu=2 memory=400 maxmem=4096) typical ssd - good only for mousless dev, for xen we need nvme replace it with 960-970 nvme - results will be x2 faster best performance is 5-7s, not 13-15s

[toothless@dom0 ~]$ time qvm-run -a test1 gnome-terminal
Running 'gnome-terminal' on test1

real    0m13.703s
user    0m0.061s
sys 0m0.055s
[toothless@dom0 ~]$ time qvm-run -a test2 gnome-terminal
Running 'gnome-terminal' on test2

real    0m13.705s
user    0m0.103s
sys 0m0.052s
[toothless@dom0 ~]$ time qvm-run -a test3 gnome-terminal
Running 'gnome-terminal' on test3

real    0m14.368s
user    0m0.080s
sys 0m0.060s
[toothless@dom0 ~]$ time qvm-run -a test4 gnome-terminal
Running 'gnome-terminal' on test4

real    0m14.369s
user    0m0.080s
sys 0m0.054s
[toothless@dom0 ~]$ time qvm-run -a test5 gnome-terminal
Running 'gnome-terminal' on test5

real    0m14.276s
user    0m0.058s
sys 0m0.053s
[toothless@dom0 ~]$ time qvm-run -a test6 gnome-terminal
Running 'gnome-terminal' on test6

real    0m14.531s
user    0m0.064s
sys 0m0.059s
[toothless@dom0 ~]$ time qvm-run -a test7 gnome-terminal
Running 'gnome-terminal' on test7

real    0m14.635s
user    0m0.061s
sys 0m0.060s
[toothless@dom0 ~]$ time qvm-run -a test8 gnome-terminal
Running 'gnome-terminal' on test8

real    0m14.773s
user    0m0.066s
sys 0m0.055s
[toothless@dom0 ~]$ time qvm-run -a test9 gnome-terminal
Running 'gnome-terminal' on test9

real    0m15.812s
user    0m0.054s
sys 0m0.059s
[toothless@dom0 ~]$ time qvm-run -a test10 gnome-terminal
Running 'gnome-terminal' on test10

real    0m15.352s
user    0m0.067s
sys 0m0.053s
[toothless@dom0 ~]$ time qvm-shutdown test10

real    0m0.196s
user    0m0.090s
sys 0m0.044s
[toothless@dom0 ~]$ time qvm-shutdown test9

real    0m0.378s
user    0m0.081s
sys 0m0.015s
[toothless@dom0 ~]$ time qvm-shutdown test8

real    0m0.434s
user    0m0.076s
sys 0m0.030s
[toothless@dom0 ~]$ time qvm-shutdown test7

real    0m0.334s
user    0m0.066s
sys 0m0.030s
[toothless@dom0 ~]$ time qvm-shutdown test6

real    0m0.150s
user    0m0.066s
sys 0m0.030s
[toothless@dom0 ~]$ time qvm-shutdown test5

real    0m0.138s
user    0m0.068s
sys 0m0.024s
[toothless@dom0 ~]$ time qvm-shutdown test4

real    0m0.343s
user    0m0.072s
sys 0m0.022s
[toothless@dom0 ~]$ time qvm-shutdown test3

real    0m0.167s
user    0m0.078s
sys 0m0.032s
[toothless@dom0 ~]$ time qvm-shutdown test2

real    0m0.187s
user    0m0.089s
sys 0m0.038s
[toothless@dom0 ~]$ time qvm-shutdown test1

real    0m0.182s
user    0m0.074s
sys 0m0.045s

[toothless@dom0 ~]$ qvm-volume extend test10:root 500G
[toothless@dom0 ~]$ time qvm-run -a test10 gnome-terminal
Running 'gnome-terminal' on test10

real    0m16.769s
user    0m0.061s
sys 0m0.053s
[toothless@dom0 ~]$ time qvm-shutdown test10

real    0m0.157s
user    0m0.080s
sys 0m0.033s

3hhh commented 2 years ago

@0xdd7fq Ty for testing. You're apparently also affected by this bug with 13.7s --> 15.3s linearly increasing on each new VM, albeit significantly less compared to my 13s --> 22s. Btw shutdown timings are pointless without --wait.

@DemiMarie It's all the same on xfs/reflink and ext4/file. This doesn't appear to be related to storage drivers.

Output from 4.0 for comparison after a qvm-shutdown --all:

Testing tvm-1...
hello

real    0m13.664s
user    0m0.070s
sys 0m0.052s
Testing tvm-2...
hello

real    0m12.578s
user    0m0.067s
sys 0m0.068s
Testing tvm-3...
hello

real    0m13.391s
user    0m0.068s
sys 0m0.062s
Testing tvm-4...
hello

real    0m14.169s
user    0m0.071s
sys 0m0.057s
Testing tvm-5...
hello

real    0m12.710s
user    0m0.063s
sys 0m0.060s
Testing tvm-6...
hello

real    0m13.313s
user    0m0.072s
sys 0m0.069s
Testing tvm-7...
hello

real    0m13.165s
user    0m0.084s
sys 0m0.053s
Testing tvm-8...
hello

real    0m13.449s
user    0m0.070s
sys 0m0.058s
Testing tvm-9...
hello

real    0m15.725s
user    0m0.070s
sys 0m0.069s
Testing tvm-10...
hello

real    0m16.589s
user    0m0.080s
sys 0m0.054s
Testing tvm-11...
hello

real    0m16.229s
user    0m0.067s
sys 0m0.059s
Testing tvm-12...
hello

real    0m13.500s
user    0m0.067s
sys 0m0.063s
Testing tvm-13...
hello

real    0m16.592s
user    0m0.065s
sys 0m0.047s

(Doesn't look fully linear to me & at least doesn't increase to 22s).

3hhh commented 2 years ago

perf diff on cpu-clock:pppH doesn't show any increase. What could it be then? Any ideas for perf events to monitor?

marmarek commented 2 years ago

I'd still guess this is related to memory pressure in some way. Maybe obtaining initial memory for the new VM to start? This can be checked by ensuring sum(maxmem) for running VMs is less than total physical memory. If in such conditions start time is the same regardless of the number of VMs, then that's it. Otherwise we need to dig deeper.

3hhh commented 2 years ago

You were somewhat right - going over the maximum physical memory has an effect, albeit only a small one.

Once I statically set the newly created Qubes to 400M, 4.0 started them up in rather constant 10 seconds, whereas 4.1 started with 21 seconds (!) and went up to 33 seconds. So the issue became even more grave for 4.1 and now I'm even not sure anymore whether it's a single issue anymore, because

a) Timings shouldn't start so badly. b) Timings shouldn't increase linearly.

Maybe some locks were introduced in 4.1 that were not in place in 4.0? Unfortunately perf lock record complains about some missing tracepoints. Is the 4.0 Xen version 4.8.5, kernel 5.4.143 available for 4.1? If yes, one could rule out an upstream Xen issue.

Hardware is a T530 with i7-3720QM and 16 GB RAM btw.

I attached the different outputs of the following script (below for mem=maxmem=400):

#!/bin/bash

qvm-shutdown --all --wait
sudo swapoff -a
swap="$(sudo mount|grep swap)"
[ -z "$swap" ] || { echo "Please disable swap first." ; exit 2 ; }

for ((i=0;i<13;i++)) ; do
  vm="tvm-$i"
  qvm-remove -f "$vm"
done

for ((i=0;i<13;i++)) ; do
  vm="tvm-$i"
  echo "Creating $vm..."
  qvm-create -l red "$vm"
  qvm-prefs "$vm" netvm ""
  qvm-prefs "$vm" memory 400
  qvm-prefs "$vm" maxmem 0

  echo "Starting $vm..."
  #beg=$SECONDS
  #sudo perf record -a --call-graph fp -o "${vm}_perf.data" qvm-run -p "$vm" 'echo hello'
  #echo "duration: $(($SECONDS - $beg))"
  time qvm-run -p "$vm" 'echo hello'
  xl info | grep free_memory
done

qvm-shutdown --all --wait

for ((i=0;i<13;i++)) ; do
  vm="tvm-$i"
  qvm-remove -f "$vm"
done

4.1, maxmem=400:

Creating tvm-0...
Starting tvm-0...
hello

real    0m21.511s
user    0m0.075s
sys 0m0.069s
free_memory            : 12662
Creating tvm-1...
Starting tvm-1...
hello

real    0m21.298s
user    0m0.066s
sys 0m0.072s
free_memory            : 12255
Creating tvm-2...
Starting tvm-2...
hello

real    0m21.324s
user    0m0.071s
sys 0m0.072s
free_memory            : 11848
Creating tvm-3...
Starting tvm-3...
hello

real    0m22.316s
user    0m0.071s
sys 0m0.064s
free_memory            : 11442
Creating tvm-4...
Starting tvm-4...
hello

real    0m23.233s
user    0m0.065s
sys 0m0.075s
free_memory            : 11035
Creating tvm-5...
Starting tvm-5...
hello

real    0m23.769s
user    0m0.075s
sys 0m0.069s
free_memory            : 10628
Creating tvm-6...
Starting tvm-6...
hello

real    0m25.669s
user    0m0.090s
sys 0m0.073s
free_memory            : 10222
Creating tvm-7...
Starting tvm-7...
hello

real    0m28.193s
user    0m0.086s
sys 0m0.061s
free_memory            : 9815
Creating tvm-8...
Starting tvm-8...
hello

real    0m31.331s
user    0m0.082s
sys 0m0.079s
free_memory            : 9408
Creating tvm-9...
Starting tvm-9...
hello

real    0m30.918s
user    0m0.083s
sys 0m0.064s
free_memory            : 9002
Creating tvm-10...
Starting tvm-10...
hello

real    0m29.939s
user    0m0.070s
sys 0m0.070s
free_memory            : 8595
Creating tvm-11...
Starting tvm-11...
hello

real    0m33.455s
user    0m0.068s
sys 0m0.067s
free_memory            : 8188
Creating tvm-12...
Starting tvm-12...
hello

real    0m33.632s
user    0m0.082s
sys 0m0.067s

4.1, maxmem=2000:

Creating tvm-0...
Starting tvm-0...
hello

real    0m20.814s
user    0m0.068s
sys 0m0.070s
free_memory            : 11066
Creating tvm-1...
Starting tvm-1...
hello

real    0m20.765s
user    0m0.074s
sys 0m0.065s
free_memory            : 9063
Creating tvm-2...
Starting tvm-2...
hello

real    0m21.296s
user    0m0.075s
sys 0m0.060s
free_memory            : 7060
Creating tvm-3...
Starting tvm-3...
hello

real    0m21.325s
user    0m0.079s
sys 0m0.058s
free_memory            : 5058
Creating tvm-4...
Starting tvm-4...
hello

real    0m22.905s
user    0m0.069s
sys 0m0.072s
free_memory            : 3055
Creating tvm-5...
Starting tvm-5...
hello

real    0m25.284s
user    0m0.083s
sys 0m0.066s
free_memory            : 1052
Creating tvm-6...
Starting tvm-6...
hello

real    0m25.772s
user    0m0.083s
sys 0m0.079s
free_memory            : 196
Creating tvm-7...
Starting tvm-7...
hello

real    0m25.290s
user    0m0.073s
sys 0m0.068s
free_memory            : 212
Creating tvm-8...
Starting tvm-8...
hello

real    0m25.617s
user    0m0.081s
sys 0m0.059s
free_memory            : 228
Creating tvm-9...
Starting tvm-9...
hello

real    0m25.769s
user    0m0.073s
sys 0m0.072s
free_memory            : 265
Creating tvm-10...
Starting tvm-10...
hello

real    0m26.931s
user    0m0.078s
sys 0m0.064s
free_memory            : 259
Creating tvm-11...
Starting tvm-11...
hello

real    0m30.857s
user    0m0.071s
sys 0m0.069s
free_memory            : 275
Creating tvm-12...
Starting tvm-12...
hello

real    0m30.719s
user    0m0.074s
sys 0m0.066s
free_memory            : 291

4.0, maxmem=400:

Creating tvm-0...
Starting tvm-0...
hello

real    0m10.026s
user    0m0.077s
sys 0m0.058s
free_memory            : 12660
Creating tvm-1...
Starting tvm-1...
hello

real    0m10.188s
user    0m0.072s
sys 0m0.059s
free_memory            : 12254
Creating tvm-2...
Starting tvm-2...
hello

real    0m10.096s
user    0m0.070s
sys 0m0.055s
free_memory            : 11856
Creating tvm-3...
Starting tvm-3...
hello

real    0m9.550s
user    0m0.074s
sys 0m0.055s
free_memory            : 11459
Creating tvm-4...
Starting tvm-4...
hello

real    0m10.192s
user    0m0.074s
sys 0m0.052s
free_memory            : 11062
Creating tvm-5...
Starting tvm-5...
hello

real    0m10.472s
user    0m0.067s
sys 0m0.055s
free_memory            : 10663
Creating tvm-6...
Starting tvm-6...
hello

real    0m10.618s
user    0m0.077s
sys 0m0.059s
free_memory            : 10266
Creating tvm-7...
Starting tvm-7...
hello

real    0m10.520s
user    0m0.068s
sys 0m0.063s
free_memory            : 9869
Creating tvm-8...
Starting tvm-8...
hello

real    0m10.793s
user    0m0.070s
sys 0m0.049s
free_memory            : 9472
Creating tvm-9...
Starting tvm-9...
hello

real    0m10.521s
user    0m0.074s
sys 0m0.063s
free_memory            : 9075
Creating tvm-10...
Starting tvm-10...
hello

real    0m10.311s
user    0m0.069s
sys 0m0.058s
free_memory            : 8676
Creating tvm-11...
Starting tvm-11...
hello

real    0m10.808s
user    0m0.071s
sys 0m0.059s
free_memory            : 8280
Creating tvm-12...
Starting tvm-12...
hello

real    0m11.281s
user    0m0.064s
sys 0m0.060s
free_memory            : 7883

4.0, maxmem=2000:

Creating tvm-0...
Starting tvm-0...
hello

real    0m11.550s
user    0m0.070s
sys 0m0.057s
free_memory            : 11064
Creating tvm-1...
Starting tvm-1...
hello

real    0m12.858s
user    0m0.076s
sys 0m0.056s
free_memory            : 9063
Creating tvm-2...
Starting tvm-2...
hello

real    0m11.649s
user    0m0.077s
sys 0m0.052s
free_memory            : 7069
Creating tvm-3...
Starting tvm-3...
hello

real    0m11.646s
user    0m0.068s
sys 0m0.061s
free_memory            : 5073
Creating tvm-4...
Starting tvm-4...
hello

real    0m11.609s
user    0m0.071s
sys 0m0.055s
free_memory            : 3079
Creating tvm-5...
Starting tvm-5...
hello

real    0m11.459s
user    0m0.079s
sys 0m0.048s
free_memory            : 1087
Creating tvm-6...
Starting tvm-6...
hello

real    0m11.967s
user    0m0.072s
sys 0m0.059s
free_memory            : 242
Creating tvm-7...
Starting tvm-7...
hello

real    0m12.352s
user    0m0.074s
sys 0m0.049s
free_memory            : 267
Creating tvm-8...
Starting tvm-8...
hello

real    0m12.316s
user    0m0.080s
sys 0m0.049s
free_memory            : 291
Creating tvm-9...
Starting tvm-9...
hello

real    0m12.589s
user    0m0.091s
sys 0m0.064s
free_memory            : 317
Creating tvm-10...
Starting tvm-10...
hello

real    0m12.423s
user    0m0.081s
sys 0m0.050s
free_memory            : 342
Creating tvm-11...
Starting tvm-11...
hello

real    0m12.544s
user    0m0.080s
sys 0m0.054s
free_memory            : 368
Creating tvm-12...
Starting tvm-12...
hello

real    0m12.802s
user    0m0.079s
sys 0m0.058s
free_memory            : 392

3hhh commented 2 years ago

I did some more testing with a small script I wrote on a fresh 4.1rc2 xfs/reflink install in comparison to an old 4.0 ext4/file install on the same machine.

Key takeaways:

A small (100ms or less) performance degradation can be observed in Xen for every newly started VM. This is the same in 4.0.
Linux startup times on my 4.0 machine ranged from 4.5-5s and on 4.1 fedora-34 and debian-11 more around 3.5s. They appear totally unrelated to the performance regression.
qrexec performance on 4.0 is ~2.5s on 400 MB VMs and 3-4.5s on machines with maxmem=2000. qrexec performance on 4.1 is rather terrible in comparison: 6.5-8s on 400MB VMs and 7-10s with maxmem=2000. This should definitely be fixed, if possible.
qrexec performance also appears to slightly degrade with the number of started VMs.
Increasing maxmem has an impact on qrexec performance in 4.0, but not in 4.1. It barely impacts Xen performance.
Overall the performance on the machine has significantly improved over my last test. Were any performance related patches merged recently?

The raw data:

testbug_4.0_maxmem400.txt testbug_4.0_maxmem2000.txt testbug_4.1_maxmem400_debian-11.txt testbug_4.1_maxmem400_fedora-34.txt testbug_4.1_maxmem2000_debian-11.txt testbug_4.1_maxmem2000_fedora-34.txt

3hhh commented 2 years ago

qrexec performance on 4.0 is ~2.5s on 400 MB VMs and 3-4.5s on machines with maxmem=2000. qrexec performance on 4.1 is rather terrible in comparison: 6.5-8s on 400MB VMs and 7-10s with maxmem=2000. This should definitely be fixed, if possible.

To me it looks like in 4.1 all qvm-run calls to a VM result in a blocking qubes.GetDate qrexec call according to the journal. In 4.0 such a call doesn't seem to be required and when it happens, it appears to be non-blocking. This might be one of the reasons why qrexec calls after VM startup are slow.

Overall the performance on the machine has significantly improved over my last test. Were any performance related patches merged recently?

It got worse again on more recent runs incl. 7s VM handovers and even slower qrexec, not sure why (Xen upgrade maybe?). EDIT: I believe I accidentally had a few sys- VMs running during the test. So that explains the differences.

3hhh commented 2 years ago

Something else I noticed: Time spent in the VM kernel in 4.1 is ~5s contrary to 1-2s in 4.0. The reason appears to be a long delay in 4.1 before mounting the file systems. See the below dmesg outputs:

4.0 (nothing special):

[    1.604135] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x256490cd5f3, max_idle_ns: 440795204886 ns
[    1.616510] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.625344] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.725277] random: sfdisk: uninitialized urandom read (4 bytes read)
[    1.725840]  xvdc: xvdc1 xvdc3
[    1.733342] random: mkswap: uninitialized urandom read (16 bytes read)
[    1.789406] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[    1.795578] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.806028] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    1.871203] EXT4-fs (xvda3): re-mounted. Opts: (null)
[    1.962792] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)

4.1 (1.4s gap after blkfront):

[    1.292877] Freeing unused kernel image (rodata/data gap) memory: 764K
[    1.292905] rodata_test: all tests were successful
[    1.292938] Run /init as init process
[    1.292949]   with arguments:
[    1.292950]     /init
[    1.292951]     rd_NO_PLYMOUTH
[    1.292951]   with environment:
[    1.292952]     HOME=/
[    1.292953]     TERM=linux
[    1.302005] Invalid max_queues (4), will use default max: 1.
[    1.549193] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.602238]  xvda: xvda1 xvda2 xvda3
[    1.620927] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.667150] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.674387] blkfront: xvdd: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled;
[    3.037417]  xvdc: xvdc1 xvdc3
[    4.134550] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[    4.148753] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    4.150726] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    4.288062] EXT4-fs (xvda3): re-mounted. Opts: (null)
[    4.450156] systemd[1]: systemd 247.3-6 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)

EDIT: Full 4.1 dmesg

Side Note: I also noticed many of the following entries in 4.1 dmesg that were not there in 4.0: [ 0.500583] acpi LNXCPU:54: Still not present

brendanhoar commented 2 years ago

Hmm, I see the magic word "random" showing up in the 4.0 log (with an uninitialized urandom read), with only an unexplained delay around the same set of actions under 4.1.

Is it possible this is, again, the kernel waiting for some time for the appropriate random pool size threshold?

B

3hhh commented 2 years ago

Is it possible this is, again, the kernel waiting for some time for the appropriate random pool size threshold?

The same 4.1 dmesg had [ 6.756485] random: crng init done much later so I guess it's not related to random not being initialised.

Anyway I updated my tool to get more precise numbers (ideally numbers not correlated to others) and received the attached results (T530, i7).

Short summary:

Xen activates VMs ~2s later in 4.1 than in 4.0 (VM handover). That handover time also linearly increases with the number of running VMs.
VM kernel setup on 4.1 takes 4-5s vs 1-2s in 4.0 (cf. dmesg).
Systemd system startup times in 4.1 are slightly better (by ~500ms).
qrexec performance on 4.1 is terrible (4-5s vs 1-2s on 4.0).
qubes.getDate is executed 1-2s earlier in 4.1, qubes.WindowIconUpdater ~4s later (not sure if that's relevant).

So in total there's at least three regressions in comparison to 4.0 from my point of view (1, 2 & 4): Slower Xen, slower kernel & slower qrexec. I can't say whether it affects other hardware setups as well, but I guess at least the qrexec implementation shouldn't be too hardware dependent.

testbug_4.0_maxmem400_debian-10.txt testbug_4.0_maxmem2000_debian-10.txt testbug_4.1_maxmem400_debian-11.txt testbug_4.1_maxmem2000_debian-11.txt

ghost commented 2 years ago

@3hhh agree.

https://github.com/QubesOS/qubes-issues/issues/6617

3hhh commented 2 years ago

Btw subsequent qrexec calls on running VMs return within 300-500ms on both 4.1 and 4.0 even though 4.0 tends to be slightly faster on average.

I also tried

xl set-parameters ept=exec-sp: no effect
remove ucode=scan: 1-2s better
add spec-ctrl=no: 1-2s better
remove smt=off: no effect

None of the options brought back 4.0 performance.

So far only Sandy & Ivy bridge CPUs are affected, maybe Ryzen?

ghost commented 2 years ago

I can still try to make a raid0 on nvme and xeon.

3hhh commented 2 years ago

Btw I re-ran the 4.0 vs 4.1 dmesg with the kernelopts xen_blkfront.dyndbg=+flp dynamic_debug.verbose=1 for some additional debug info.

4.0:

[    2.048229] Invalid max_queues (4), will use default max: 2.
[    2.051054] blkback_changed:2466: vbd vbd-51712: blkfront:blkback_changed to state 2.
[    2.141573] blkback_changed:2466: vbd vbd-51728: blkfront:blkback_changed to state 2.
[    2.262393] blkback_changed:2466: vbd vbd-51744: blkfront:blkback_changed to state 2.
[    2.376785] blkback_changed:2466: vbd vbd-51760: blkfront:blkback_changed to state 2.
[    2.449277] blkback_changed:2466: vbd vbd-51712: blkfront:blkback_changed to state 4.
[    2.449283] blkfront_connect:2394: vbd vbd-51712: blkfront_connect:/local/domain/0/backend/vbd/15/51712.
[    2.513042] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    2.544954]  xvda: xvda1 xvda2 xvda3
[    2.605883] blkback_changed:2466: vbd vbd-51728: blkfront:blkback_changed to state 4.
[    2.605886] blkfront_connect:2394: vbd vbd-51728: blkfront_connect:/local/domain/0/backend/vbd/15/51728.
[    2.608934] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    2.620726] blkback_changed:2466: vbd vbd-51744: blkfront:blkback_changed to state 4.
[    2.620728] blkfront_connect:2394: vbd vbd-51744: blkfront_connect:/local/domain/0/backend/vbd/15/51744.
[    2.623037] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    2.631496] blkback_changed:2466: vbd vbd-51760: blkfront:blkback_changed to state 4.
[    2.631502] blkfront_connect:2394: vbd vbd-51760: blkfront_connect:/local/domain/0/backend/vbd/15/51760.
[    2.633915] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    2.674652] blkif_ioctl:500: vbd vbd-51744: command: 0x5331, argument: 0x0
[    2.675534] blkif_ioctl:500: vbd vbd-51744: command: 0x5331, argument: 0x0
[    2.688424] random: sfdisk: uninitialized urandom read (4 bytes read)
[    2.689337]  xvdc: xvdc1 xvdc3
[    2.700853] random: mkswap: uninitialized urandom read (16 bytes read)
[    2.723988] blkif_ioctl:500: vbd vbd-51712: command: 0x5331, argument: 0x0
[    2.732615] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[    2.738321] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    2.740320] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    2.744195] dynamic_debug:ddebug_add_module: 23 debug prints in module overlay
[    2.783727] EXT4-fs (xvda3): re-mounted. Opts: (null)

4.1:

[    2.299636] Invalid max_queues (4), will use default max: 2.
[    2.302851] blkback_changed:2474: vbd vbd-51712: blkfront:blkback_changed to state 2.
[    2.389328] blkback_changed:2474: vbd vbd-51728: blkfront:blkback_changed to state 2.
[    2.628404] blkback_changed:2474: vbd vbd-51744: blkfront:blkback_changed to state 2.
[    2.801922] blkback_changed:2474: vbd vbd-51760: blkfront:blkback_changed to state 2.
[    2.955371] blkback_changed:2474: vbd vbd-51712: blkfront:blkback_changed to state 4.
[    2.955375] blkfront_connect:2401: vbd vbd-51712: blkfront_connect:/local/domain/0/backend/vbd/24/51712.
[    2.957324] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    2.961239]  xvda: xvda1 xvda2 xvda3
[    3.101208] blkback_changed:2474: vbd vbd-51728: blkfront:blkback_changed to state 4.
[    3.101211] blkfront_connect:2401: vbd vbd-51728: blkfront_connect:/local/domain/0/backend/vbd/24/51728.
[    3.103325] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    3.128767] blkback_changed:2474: vbd vbd-51744: blkfront:blkback_changed to state 4.
[    3.128770] blkfront_connect:2401: vbd vbd-51744: blkfront_connect:/local/domain/0/backend/vbd/24/51744.
[    3.207748] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    3.215940] blkback_changed:2474: vbd vbd-51760: blkfront:blkback_changed to state 4.
[    3.215947] blkfront_connect:2401: vbd vbd-51760: blkfront_connect:/local/domain/0/backend/vbd/24/51760.
[    3.219995] blkfront: xvdd: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled;
[    3.342172] blkif_ioctl:509: vbd vbd-51744: command: 0x5331, argument: 0x0
[    3.342698] blkif_ioctl:509: vbd vbd-51744: command: 0x5331, argument: 0x0
[    3.355123] blkif_ioctl:509: vbd vbd-51744: command: 0x5331, argument: 0x0
[    4.367889] blkif_ioctl:509: vbd vbd-51744: command: 0x5331, argument: 0x0
[    4.381678] blkif_ioctl:509: vbd vbd-51744: command: 0x5331, argument: 0x0
[    4.640679]  xvdc: xvdc1 xvdc3
[    5.723486] blkif_ioctl:509: vbd vbd-51712: command: 0x5331, argument: 0x0
[    5.753137] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[    5.776391] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    5.781462] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    5.949733] EXT4-fs (xvda3): re-mounted. Opts: (null)

To me it looks like the Xen storage backend is slower in 4.1 (xen_blkfront is waiting more). Also not sure why it checks the CDROM_GET_CAPABILITY (0x5331) so often and why xvdd has flushing disabled on 4.1, but enabled on 4.0.

marmarek commented 2 years ago

Is the VM kernel version the same in both cases? The easiest way to get the same kernel version is to use kernel-latest-qubes-vm pkg.

3hhh commented 2 years ago

Is the VM kernel version the same in both cases?

No, but good idea to test various kernel versions!

I did that in 4.0 and found that the last good kernel wrt this blkfront delay is 5.10.8-1 and the first definitely bad 5.11.8-1. The 4.1 example from above is 5.10.76-1, i.e. probably the issue is between 5.10.8-1 and 5.10.76-1 (I didn't test 4.1 so far). 5.11.4-1 on 4.0 was exposing weird behaviour in general, but not that specific regression though...

This might also slightly affect the qrexec issue. I at least noticed a ~1s additional delay with a bad kernel there (still ~3s better than in 4.1). VM handover times in 4.0 however remain at ~2s less than the 4.1 times even with those kernels.

3hhh commented 2 years ago

I tested a few more kernels on 4.0 and found that 5.10.16-1 is good as well. Double checking on 5.11.4-1 revealed that it is likely the last good kernel wrt this blkfront delay (5.4.156-1 dom0 kernel, Xen 4.8.5).

On 4.1 (5.10.76-1 dom0 kernel, Xen 4.14.3) surprisingly I couldn't find a single "good" VM kernel wrt this blkfront delay, not even 5.4.55-1. Actually newer kernels even made things more slow... I failed to get in-VM kernels going.

Another comparison from /var/log/xen/ which appears to have a bit more information:

4.0:

[    0.000000] Linux version 5.10.16-1.fc25.qubes.x86_64 (user@build-fedora4) (gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1), GNU ld version 2.26.1-1.fc25) #1 
[    0.000000] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH rd.plymouth.enable=0 plymouth.enable=0 xen_scrub_pages=0 xen_blkfront.dyndbg=+flp dynamic_debug.verbose=1
...
[    0.875938] Invalid max_queues (4), will use default max: 2.
[    1.274497] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.293832]  xvda: xvda1 xvda2 xvda3
[    1.344247] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.404975] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    1.411100] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
Waiting for /dev/xvda* devices...
Qubes: Doing R/W setup for TemplateVM...
[    1.477060] random: sfdisk: uninitialized urandom read (4 bytes read)
[    1.477566]  xvdc: xvdc1 xvdc3
[    1.486487] random: mkswap: uninitialized urandom read (16 bytes read)
Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
no label, UUID=b2151064-3fe6-49f3-bf59-b9dcf4a28c81
Qubes: done.
[    1.526014] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
Waiting for /dev/xvdd device...
mount: /dev/xvdd is write-protected, mounting read-only
[    1.533772] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.542292] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    1.594591] EXT4-fs (xvda3): re-mounted. Opts: (null)
switch_root: failed to mount moving /dev to /sysroot/dev: Invalid argument
switch_root: forcing unmount of /dev
switch_root: failed to mount moving /proc to /sysroot/proc: Invalid argument
switch_root: forcing unmount of /proc
switch_root: failed to mount moving /sys to /sysroot/sys: Invalid argument
switch_root: forcing unmount of /sys
switch_root: failed to mount moving /run to /sysroot/run: Invalid argument
switch_root: forcing unmount of /run
[    1.688390] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)

4.1:

[2022-01-17 23:20:46] [    0.000000] Linux version 5.10.76-1.fc32.qubes.x86_64 (mockbuild@build-fedora4) (gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1), GNU ld version 2.34-6.fc32) #1 SMP Fri Oct 29 03:58:58 CEST 2021
[2022-01-17 23:20:46] [    0.000000] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH rd.plymouth.enable=0 plymouth.enable=0 xen_scrub_pages=0 nopat
...
[2022-01-17 23:23:22] [    2.656119] Run /init as init process
[2022-01-17 23:23:22] Qubes initramfs script here:
[2022-01-17 23:23:22] [    2.669616] Invalid max_queues (4), will use default max: 2.
[2022-01-17 23:23:23] [    3.079762] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[2022-01-17 23:23:23] [    3.107136]  xvda: xvda1 xvda2 xvda3
[2022-01-17 23:23:23] [    3.200808] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[2022-01-17 23:23:23] [    3.266375] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[2022-01-17 23:23:23] [    3.279484] blkfront: xvdd: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled;
[2022-01-17 23:23:23] Waiting for /dev/xvda* devices...
[2022-01-17 23:23:23] Qubes: Doing R/W setup for TemplateVM...
[2022-01-17 23:23:24] [    4.651077]  xvdc: xvdc1 xvdc3
[2022-01-17 23:23:25] Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
[2022-01-17 23:23:25] no label, UUID=06b63771-96e5-4b48-8777-c99380528e34
[2022-01-17 23:23:25] Qubes: done.
[2022-01-17 23:23:25] [    5.774377] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[2022-01-17 23:23:25] Waiting for /dev/xvdd device...
[2022-01-17 23:23:25] [    5.797393] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[2022-01-17 23:23:25] [    5.802029] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[2022-01-17 23:23:25] mount: /tmp/modules: WARNING: device write-protected, mounted read-only.
[2022-01-17 23:23:26] [    5.984659] EXT4-fs (xvda3): re-mounted. Opts: (null)
[2022-01-17 23:23:26] switch_root: failed to mount moving /dev to /sysroot/dev: Invalid argument
[2022-01-17 23:23:26] switch_root: forcing unmount of /dev
[2022-01-17 23:23:26] switch_root: failed to mount moving /proc to /sysroot/proc: Invalid argument
[2022-01-17 23:23:26] switch_root: forcing unmount of /proc
[2022-01-17 23:23:26] switch_root: failed to mount moving /sys to /sysroot/sys: Invalid argument
[2022-01-17 23:23:26] switch_root: forcing unmount of /sys
[2022-01-17 23:23:26] switch_root: failed to mount moving /run to /sysroot/run: Invalid argument
[2022-01-17 23:23:26] switch_root: forcing unmount of /run
[2022-01-17 23:23:26] libbpf: failed to find valid kernel BTF
[2022-01-17 23:23:26] libbpf: Error loading vmlinux BTF: -3
[2022-01-17 23:23:26] libbpf: failed to load object 'iterators_bpf'
[2022-01-17 23:23:26] libbpf: failed to load BPF skeleton 'iterators_bpf': -3
[2022-01-17 23:23:26] Failed load could be due to wrong endianness
[2022-01-17 23:23:26] [    6.297525] systemd[1]: systemd 247.3-6 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
[2022-01-17 23:23:26] [    6.297733] systemd[1]: Detected virtualization xen.
[2022-01-17 23:23:26] [    6.297758] systemd[1]: Detected architecture x86-64.

Btw this xen:balloon in 4.0 and 4.1 sometimes eats an additional ~1.3s of kernel boot time (oddly sometimes only 0.1s). Is this really necessary and if so: Why does the boot process have to wait for it? Why run it for fixed memory VMs?

marmarek commented 2 years ago

Btw this xen:balloon in 4.0 and 4.1 sometimes eats an additional ~1.3s of kernel boot time (oddly sometimes only 0.1s). Is this really necessary and if so: Why does the boot process have to wait for it? Why run it for fixed memory VMs?

For fixed memory VMs it should be no-op. If it takes some time, most likely it is not a fixed memory VM. And yes, it is necessary, otherwise it risks crashing the VM with rather obscure effects.

3hhh commented 2 years ago

Thanks for the clarification!

How can one debug Xen for the Xen VM handover time issue? I.e. get some more debug output with timestamps before Xen actually activates a VM?

marmarek commented 2 years ago

How can one debug Xen for the Xen VM handover time issue? I.e. get some more debug output with timestamps before Xen actually activates a VM?

You can do that by setting log_level=1 in /etc/libvirt/libvirtd.conf (and restarting libvirtd service). It will produce a lot of output into system journal, but you are interested in something else: /var/log/libvirt/libxl/libxl-driver.log. After setting debug log level, it will contain a lot more info on VM startup, with timestamps.

3hhh commented 2 years ago

you are interested in something else: /var/log/libvirt/libxl/libxl-driver.log

Now that was helpful as it shows

Time in Xen is 2-4s depending on the number of running VMs (aforementioned linear regression, ~2s at 0 VMs and ~4s at around 10 VMs). That linear regression likely already existed in 4.0 Xen.
file-reflink eats a rather constant 3s of the VM boot time in 4.1.
file in 4.0 appears to hand over to Xen almost instantaneously (unlikely: or the Starting vm.[vm-name] log entry is written at a different point in time in 4.0). Not sure why it's so blazingly fast.

So in total the ~3s regression in VM handover times with 4.1 results from me migrating from the file pool driver in 4.0 to file-reflink in 4.1.

@rustybird Would it be possible to prepare these snapshots or whatever file-reflink is doing before passing over to Xen at another point in time to speed up the VM boot process? E.g. right after the last VM exit?

@marmarek Adding a log entry such as vm.[vm name]: Storage initialized. before Xen handover would be useful to more easily identify such issues.

Now only the qrexec time regression remains to be identified, but I guess it's either related to the previously analyzed VM kernel issue or the new 4.1 qrexec socket setup code.

EDIT: For explanation: The "Xen handover time" measured by my tool = storage pool setup time / whatever Qubes does + Xen init time.

marmarek commented 2 years ago

3. file in 4.0 appears to hand over to Xen almost instantaneously (unlikely: or the Starting vm.[vm-name] log entry is written at a different point in time in 4.0). Not sure why it's so blazingly fast.

There was a major change how "file" storage works between R4.0 and R4.1 - in R4.0 it is initialized later (called from libxl when creating blkback), while in R4.1 it's initialized before telling libxl anything.

DemiMarie commented 2 years ago

file in 4.0 appears to hand over to Xen almost instantaneously (unlikely: or the Starting vm.[vm-name] log entry is written at a different point in time in 4.0). Not sure why it's so blazingly fast.

There was a major change how "file" storage works between R4.0 and R4.1 - in R4.0 it is initialized later (called from libxl when creating blkback), while in R4.1 it's initialized before telling libxl anything.

Does that mean that QubesOS/qubes-core-admin#399 was the cause of the regression?

rustybird commented 2 years ago

@3hhh

Would it be possible to prepare these snapshots or whatever file-reflink is doing before passing over to Xen at another point in time to speed up the VM boot process? E.g. right after the last VM exit?

Interesting idea. At least for save_on_stop volumes (e.g. the AppVM private volume) yes the reflink cost could be shifted to the volume stop phase by creating a private-precache.img, which would then be renamed to private-dirty.img on volume start. Seems like mostly a matter of getting cache invalidation right for revision handling, importing, etc.

For snap_on_start volumes (e.g. the AppVM root volume) it's not feasible EDIT: but there's an opportunity to replace two reflink operations with just one.

-> https://github.com/QubesOS/qubes-core-admin/pull/449

rustybird commented 2 years ago

@DemiMarie

file in 4.0 appears to hand over to Xen almost instantaneously (unlikely: or the Starting vm.[vm-name] log entry is written at a different point in time in 4.0). Not sure why it's so blazingly fast.

There was a major change how "file" storage works between R4.0 and R4.1 - in R4.0 it is initialized later (called from libxl when creating blkback), while in R4.1 it's initialized before telling libxl anything.

Does that mean that QubesOS/qubes-core-admin#399 was the cause of the regression?

No just a change in accounting. Before your PR all the dmsetup invocations etc. were done implicitly after handover to Xen.

3hhh commented 2 years ago

5. file in 4.0 appears to hand over to Xen almost instantaneously (unlikely: or the Starting vm.[vm-name] log entry is written at a different point in time in 4.0). Not sure why it's so blazingly fast.

There was a major change how "file" storage works between R4.0 and R4.1 - in R4.0 it is initialized later (called from libxl when creating blkback), while in R4.1 it's initialized before telling libxl anything.

Does that mean that QubesOS/qubes-core-admin#399 was the cause of the regression?

No just a change in accounting. Before your PR all the dmsetup invocations etc. were done implicitly after handover to Xen.

I'd rather suspect that letting libxl know about storage init allows it to run storage init and Xen init code in parallel.

So to me it sounds more like 4.0 file pool users were the rare birds that had unexpectedly good performance due to the late block script init whereas everyone else was already synchronuously initialized and QubesOS/qubes-core-admin#399 finally grounded those rare birds in 4.1. ^^ At least that explains why my attempts in 4.1 with file pools didn't provide any performance gain.

Argh... At least the blkfront wait should be unrelated as it also happens in 4.0 after kernel 5.11.4-1 with file pools.

Maybe I'll look into the qrexec issue next...

QubesOS / qubes-issues

4.1 VM startup & qrexec performance issues #7075

Qubes OS release

Brief summary

Steps to reproduce

beg=$SECONDS

sudo perf record -a --call-graph fp -o "${vm}_perf.data" qvm-run -p "$vm" 'echo hello'

echo "duration: $(($SECONDS - $beg))"