Open alex-ab opened 3 months ago
I enabled the support for the NOVA kernel, by porting relevant former work to our version, and managed to enable the support for the Seoul VMM on AMD and Intel machines. If all works out, Linux reports something along the lines:
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point regi
[init -> seoul] VMM: # | sters'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[init -> seoul] VMM: # [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[init -> seoul] VMM: # [ 0.000000] x86/fpu: xstate_offset[5]: 832, xstate_sizes[5]: 64
[init -> seoul] VMM: # [ 0.000000] x86/fpu: xstate_offset[6]: 896, xstate_sizes[6]: 512
[init -> seoul] VMM: # [ 0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[init -> seoul] VMM: # [ 0.000000] x86/fpu: Enabled xstate features 0xe7, context size is 2432 bytes
[init -> seoul] VMM: # | , using 'compacted' format.
Additionally, during testing I found the following tool very helpful, in order to detect the correct working and that indeed all variants of AVX are enabled and working, https://github.com/travisdowns/avx-turbo.git. Additionally it measures the maximal operation per seconds which are doable.
The tool output from within a VM without AVX support reports:
CPUID highest leaf : [ dh]
Running as root : [NO ]
MSR reads supported : [NO ]
CPU pinning enabled : [YES]
CPU supports zeroupper: [NO ]
CPU supports AVX2 : [NO ]
CPU supports AVX-512F : [NO ]
CPU supports AVX-512VL: [NO ]
CPU supports AVX-512BW: [NO ]
CPU supports AVX-512CD: [NO ]
CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
tsc_freq = 2995.2 MHz (from calibration loop)
CPU brand string: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
2 available CPUs: [0, 1]
Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID | Description | OVRLP3 | Mops
1 | pause_only | pause instruction | 1.000 | 1649
1 | scalar_iadd | Scalar integer adds | 1.000 | 4290
Cores | ID | Description | OVRLP3 | Mops
2 | pause_only | pause instruction | 1.000 | 2829, 2840
2 | scalar_iadd | Scalar integer adds | 1.000 | 3884, 3873
And with AVX enabled:
PUID highest leaf : [ dh]
Running as root : [NO ]
MSR reads supported : [NO ]
CPU pinning enabled : [YES]
CPU supports zeroupper: [YES]
CPU supports AVX2 : [YES]
CPU supports AVX-512F : [YES]
CPU supports AVX-512VL: [YES]
CPU supports AVX-512BW: [YES]
CPU supports AVX-512CD: [YES]
CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
tsc_freq = 2995.2 MHz (from calibration loop)
CPU brand string: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
2 available CPUs: [0, 1]
Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID | Description | OVRLP3 | Mops
1 | pause_only | pause instruction | 1.000 | 1649
1 | ucomis_clean | scalar ucomis (w/ vzeroupper) | 1.000 | 1065
1 | ucomis_dirty | scalar ucomis (no vzeroupper) | 1.000 | 1065
1 | scalar_iadd | Scalar integer adds | 1.000 | 4290
1 | avx128_iadd | 128-bit integer serial adds | 1.000 | 4290
1 | avx256_iadd | 256-bit integer serial adds | 1.000 | 4290
1 | avx512_iadd | 512-bit integer serial adds | 1.000 | 4290
1 | avx128_iadd16 | 128-bit integer serial adds zmm16 | 1.000 | 4290
1 | avx256_iadd16 | 256-bit integer serial adds zmm16 | 1.000 | 4291
1 | avx512_iadd16 | 512-bit integer serial adds zmm16 | 1.000 | 4290
1 | avx128_iadd_t | 128-bit integer parallel adds | 1.000 | 12870
1 | avx256_iadd_t | 256-bit integer parallel adds | 1.000 | 12870
1 | avx128_xor_zero | 128-bit zeroing xor | 1.000 | 21236
1 | avx256_xor_zero | 256-bit zeroing xor | 1.000 | 21240
1 | avx512_xor_zero | 512-bit zeroing xord | 1.000 | 21231
1 | avx128_mov_sparse | 128-bit reg-reg mov | 1.000 | 4290
1 | avx256_mov_sparse | 256-bit reg-reg mov | 1.000 | 4290
1 | avx512_mov_sparse | 512-bit reg-reg mov | 1.000 | 4291
1 | avx128_merge_sparse | 128-bit reg-reg merge mov | 1.000 | 4290
1 | avx256_merge_sparse | 256-bit reg-reg merge mov | 1.000 | 4290
1 | avx512_merge_sparse | 512-bit reg-reg merge mov | 1.000 | 4290
1 | avx128_vshift | 128-bit variable shift (vpsrlvd) | 1.000 | 4290
1 | avx256_vshift | 256-bit variable shift (vpsrlvd) | 1.000 | 4290
1 | avx512_vshift | 512-bit variable shift (vpsrlvd) | 1.000 | 4290
1 | avx128_vshift_t | 128-bit variable shift (vpsrlvd) | 1.000 | 8580
1 | avx256_vshift_t | 256-bit variable shift (vpsrlvd) | 1.000 | 8579
1 | avx512_vshift_t | 512-bit variable shift (vpsrlvd) | 1.000 | 4290
1 | avx128_vlzcnt | 128-bit lzcnt (vplzcntd) | 1.000 | 1073
1 | avx256_vlzcnt | 256-bit lzcnt (vplzcntd) | 1.000 | 1073
1 | avx512_vlzcnt | 512-bit lzcnt (vplzcntd) | 1.000 | 1073
1 | avx128_vlzcnt_t | 128-bit lzcnt (vplzcntd) | 1.000 | 8581
1 | avx256_vlzcnt_t | 256-bit lzcnt (vplzcntd) | 1.000 | 8579
1 | avx512_vlzcnt_t | 512-bit lzcnt (vplzcntd) | 1.000 | 4290
1 | avx128_imul | 128-bit integer muls (vpmuldq) | 1.000 | 858
1 | avx256_imul | 256-bit integer muls (vpmuldq) | 1.000 | 858
1 | avx512_imul | 512-bit integer muls (vpmuldq) | 1.000 | 858
1 | avx128_fma_sparse | 128-bit 64-bit sparse FMAs | 1.000 | 4290
1 | avx256_fma_sparse | 256-bit 64-bit sparse FMAs | 1.000 | 4290
1 | avx512_fma_sparse | 512-bit 64-bit sparse FMAs | 1.000 | 4290
1 | avx128_fma | 128-bit serial DP FMAs | 1.000 | 1073
1 | avx256_fma | 256-bit serial DP FMAs | 1.000 | 1073
1 | avx512_fma | 512-bit serial DP FMAs | 1.000 | 1073
1 | avx128_fma_t | 128-bit parallel DP FMAs | 1.000 | 8579
1 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 8580
1 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 4290
1 | avx512_vpermw | 512-bit serial WORD permute | 1.000 | 1073
1 | avx512_vpermw_t | 512-bit parallel WORD permute | 1.000 | 4290
1 | avx512_vpermd | 512-bit serial DWORD permute | 1.000 | 1430
1 | avx512_vpermd_t | 512-bit parallel DWORD permute | 1.000 | 4290
Cores | ID | Description | OVRLP3 | Mops
2 | pause_only | pause instruction | 1.000 | 2830, 2862
2 | ucomis_clean | scalar ucomis (w/ vzeroupper) | 1.000 | 1047, 1047
2 | ucomis_dirty | scalar ucomis (no vzeroupper) | 1.000 | 1047, 1046
2 | scalar_iadd | Scalar integer adds | 1.000 | 3878, 3884
2 | avx128_iadd | 128-bit integer serial adds | 1.000 | 3737, 3742
2 | avx256_iadd | 256-bit integer serial adds | 1.000 | 3737, 3746
2 | avx512_iadd | 512-bit integer serial adds | 1.000 | 3900, 3900
2 | avx128_iadd16 | 128-bit integer serial adds zmm16 | 1.000 | 3746, 3738
2 | avx256_iadd16 | 256-bit integer serial adds zmm16 | 1.000 | 3735, 3744
2 | avx512_iadd16 | 512-bit integer serial adds zmm16 | 1.000 | 3922, 3919
2 | avx128_iadd_t | 128-bit integer parallel adds | 1.000 | 6433, 6434
2 | avx256_iadd_t | 256-bit integer parallel adds | 1.000 | 6446, 6440
2 | avx128_xor_zero | 128-bit zeroing xor | 1.000 | 10619, 10615
2 | avx256_xor_zero | 256-bit zeroing xor | 1.000 | 10608, 10619
2 | avx512_xor_zero | 512-bit zeroing xord | 1.000 | 10597, 10613
2 | avx128_mov_sparse | 128-bit reg-reg mov | 1.000 | 3873, 3878
2 | avx256_mov_sparse | 256-bit reg-reg mov | 1.000 | 3871, 3884
2 | avx512_mov_sparse | 512-bit reg-reg mov | 1.000 | 3879, 3874
2 | avx128_merge_sparse | 128-bit reg-reg merge mov | 1.000 | 3877, 3879
2 | avx256_merge_sparse | 256-bit reg-reg merge mov | 1.000 | 3878, 3877
2 | avx512_merge_sparse | 512-bit reg-reg merge mov | 1.000 | 3879, 3878
2 | avx128_vshift | 128-bit variable shift (vpsrlvd) | 1.000 | 3914, 3915
2 | avx256_vshift | 256-bit variable shift (vpsrlvd) | 1.000 | 3915, 3917
2 | avx512_vshift | 512-bit variable shift (vpsrlvd) | 1.000 | 2095, 2095
2 | avx128_vshift_t | 128-bit variable shift (vpsrlvd) | 1.000 | 4292, 4293
2 | avx256_vshift_t | 256-bit variable shift (vpsrlvd) | 1.000 | 4284, 4291
2 | avx512_vshift_t | 512-bit variable shift (vpsrlvd) | 1.000 | 2090, 2091
2 | avx128_vlzcnt | 128-bit lzcnt (vplzcntd) | 1.000 | 1072, 1072
2 | avx256_vlzcnt | 256-bit lzcnt (vplzcntd) | 1.000 | 1072, 1072
2 | avx512_vlzcnt | 512-bit lzcnt (vplzcntd) | 1.000 | 1072, 1072
2 | avx128_vlzcnt_t | 128-bit lzcnt (vplzcntd) | 1.000 | 4299, 4295
2 | avx256_vlzcnt_t | 256-bit lzcnt (vplzcntd) | 1.000 | 4287, 4307
2 | avx512_vlzcnt_t | 512-bit lzcnt (vplzcntd) | 1.000 | 2089, 2092
2 | avx128_imul | 128-bit integer muls (vpmuldq) | 1.000 | 858, 858
2 | avx256_imul | 256-bit integer muls (vpmuldq) | 1.000 | 858, 858
2 | avx512_imul | 512-bit integer muls (vpmuldq) | 1.000 | 858, 858
2 | avx128_fma_sparse | 128-bit 64-bit sparse FMAs | 1.000 | 3877, 3877
2 | avx256_fma_sparse | 256-bit 64-bit sparse FMAs | 1.000 | 3880, 3878
2 | avx512_fma_sparse | 512-bit 64-bit sparse FMAs | 1.000 | 3877, 3874
2 | avx128_fma | 128-bit serial DP FMAs | 1.000 | 1072, 1072
2 | avx256_fma | 256-bit serial DP FMAs | 1.000 | 1072, 1072
2 | avx512_fma | 512-bit serial DP FMAs | 1.000 | 1072, 1072
2 | avx128_fma_t | 128-bit parallel DP FMAs | 1.000 | 4293, 4280
2 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 4285, 4294
2 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 2089, 2091
2 | avx512_vpermw | 512-bit serial WORD permute | 1.000 | 1069, 1069
2 | avx512_vpermw_t | 512-bit parallel WORD permute | 1.000 | 2145, 2146
2 | avx512_vpermd | 512-bit serial DWORD permute | 1.000 | 1430, 1430
2 | avx512_vpermd_t | 512-bit parallel DWORD permute | 1.000 | 2149, 2142
Additionally, I used on a U4711 notebook a Debian 12 VM with Firefox the MotionMark 1.3.1 from browserbench.org, as already used by @jschlatow during his browser performance analysis on Genodians.org. Even so the results are not very stable and fluctuate, it looks as it seems to have a positive effect, best results:
w/o AVX, but with SSE*: 23.74 @ 60fps +- 224.84 %
with AVX commits : 167.28 @ 60fps +- 29.97 %
Additionally, I downloaded a video from jellyfish, https://repo.jellyfin.org/jellyfish/jellyfish-30-mbps-hd-h264.mkv, and used ffmpeg to transcode the file, in order to see some impact. The both files are attached, and the diff of the output is below. Some improvements are visible.
ffmpeg -benchmark -i jellyfish-30-mbps-hd-h264.mkv -c:v libx265 -preset medium -crf 20 -c:a copy jellyfish-30-mbps-hd-h265-crf20.mkv
x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
encoded 900 frames in 300.06s (3.00 fps), 11242.09 kb/s, Avg QP:24.48
x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3
encoded 900 frames in 243.94s (3.69 fps), 11242.09 kb/s, Avg QP:24.48
Another test from the Phoronix test suite, e.g. Bosphorus, manually executed (so not using the test suite), shows following results. The traces and command invocation are part of the attached log files.
x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
encoded 600 frames in 69.54s (8.63 fps), 1271.47 kb/s, Avg QP:33.68
x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3
encoded 600 frames in 52.36s (11.46 fps), 1271.47 kb/s, Avg QP:33.68
sse_x265_Bosphorus_1920x1080.txt avx_x265_Bosphorus_1920x1080.txt
depot_autopilot/test-pthread failed last night with #UD
on x86_64.
[2024-08-13 03:30:11] [init -> depot_autopilot] 1.308 [init -> test-pthread] main thread: start PTHREAD_MUTEX_NORMAL stress test
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.305', cpu 2, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.310', cpu 5, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.307', cpu 6, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.311', cpu 7, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.309', cpu 3, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.308', cpu 1, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.306', cpu 4, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:15] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.303', cpu 7, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:31:40] [init -> depot_autopilot]
[2024-08-13 03:31:40] [init -> depot_autopilot] test-pthread failed 89.987 timeout 90 sec
Same occurred with AVX patches from 2024-08-06 at 2024-08-07 03:51:53.
I added the commits to get AVX working with vbox6, tested with a debian, ubuntu and win10 VM on a modular sculpt.
@alex-ab would you mind to record the remaining problems with avx-turbo in this issue? I agree that we don't have to fix them if they are specific to the use of the tool only and don't happen in real scenarios.
@alex-ab would you mind to record the remaining problems with avx-turbo in this issue? I agree that we don't have to fix them if they are specific to the use of the tool only and don't happen in real scenarios.
I found the issue with the test. It divides on TSC frequency calculation by 0 which fails. I added a patch for in vbox6 usage. Instead of reading out the frequency (which is not provided by vbox6), it measures it and then the whole AVX test works.
--- a/tsc-support.cpp
+++ b/tsc-support.cpp
@@ -41,7 +41,8 @@ uint64_t get_tsc_from_cpuid_inner() {
if (family.family == 6) {
- if (family.model == 0x4E || family.model == 0x5E || family.model == 0x8E || family.model == 0x9E) {
+ printf("%s:%u division by %u is not good !!!\n", __func__, __LINE__, cpuid15.eax);
+ if (cpuid15.eax && (family.model == 0x4E || family.model == 0x5E || family.model == 0x8E || family.model == 0x9E)) {
// skylake client or kabylake
return (int64_t)24000000 * cpuid15.ebx / cpuid15.eax; // 24 MHz crystal clock
}
@chelmuth: please add the fixup and the aes commit to staging from my staging branch
Thanks, merged to staging.
FWIW, 4903595 enables RDRAND
and RDSEED
. I've been using the commit for some time now w/o any noticeable problems.
Can you report any positive performance (or other) impact?
@chelmuth well, I did not perform any testing so I cannot comment either way (especially as I have not enabled them in isolation, i.e. AVX/AES was already enabled and could skew the results).
The various AVX FPU extensions for x86 CPUs can be used for various media centered and/or in general mathematical optimized work load (beside GPUs). The feature is nowadays common across all relevant CPU vendors in various extensions (AVX, AVX2, AVX512). Especially in the context of the VM, an enablement may improve runtime and/or CPU usage of guest applications, which are capable of using these FPU extensions. Let us enable it.
Steps to work on respectively consider: