x86: add support to use AVX* CPU features

alex-ab commented 3 months ago

The various AVX FPU extensions for x86 CPUs can be used for various media centered and/or in general mathematical optimized work load (beside GPUs). The feature is nowadays common across all relevant CPU vendors in various extensions (AVX, AVX2, AVX512). Especially in the context of the VM, an enablement may improve runtime and/or CPU usage of guest applications, which are capable of using these FPU extensions. Let us enable it.

Steps to work on respectively consider:

[x] nova kernel support
[ ] base-hw kernel support
[ ] other kernel support
[x] VM session adaptations, e.g. storing/loading more FPU state, size varies depending on host features
[x] Seoul VMM support
[x] VBox6 VMM support
[ ] extended Genode framework support, e.g. compiler switches, where appropriate store/load more FPU state

alex-ab commented 3 months ago

I enabled the support for the NOVA kernel, by porting relevant former work to our version, and managed to enable the support for the Seoul VMM on AMD and Intel machines. If all works out, Linux reports something along the lines:

[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point regi
[init -> seoul] VMM: # |   sters'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[init -> seoul] VMM: #   [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[init -> seoul] VMM: #   [    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[init -> seoul] VMM: #   [    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[init -> seoul] VMM: #   [    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[init -> seoul] VMM: #   [    0.000000] x86/fpu: Enabled xstate features 0xe7, context size is 2432 bytes
[init -> seoul] VMM: # |   , using 'compacted' format.

Additionally, during testing I found the following tool very helpful, in order to detect the correct working and that indeed all variants of AVX are enabled and working, https://github.com/travisdowns/avx-turbo.git. Additionally it measures the maximal operation per seconds which are doable.

The tool output from within a VM without AVX support reports:

CPUID highest leaf    : [ dh]
Running as root       : [NO ]
MSR reads supported   : [NO ]
CPU pinning enabled   : [YES]
CPU supports zeroupper: [NO ]
CPU supports AVX2     : [NO ]
CPU supports AVX-512F : [NO ]
CPU supports AVX-512VL: [NO ]
CPU supports AVX-512BW: [NO ]
CPU supports AVX-512CD: [NO ]
CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
tsc_freq = 2995.2 MHz (from calibration loop)
CPU brand string: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
2 available CPUs: [0, 1]
Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID          | Description         | OVRLP3 | Mops
1     | pause_only  | pause instruction   |  1.000 | 1649
1     | scalar_iadd | Scalar integer adds |  1.000 | 4290

Cores | ID          | Description         | OVRLP3 |       Mops
2     | pause_only  | pause instruction   |  1.000 | 2829, 2840
2     | scalar_iadd | Scalar integer adds |  1.000 | 3884, 3873

And with AVX enabled:

PUID highest leaf    : [ dh]
Running as root       : [NO ]
MSR reads supported   : [NO ]
CPU pinning enabled   : [YES]
CPU supports zeroupper: [YES]
CPU supports AVX2     : [YES]
CPU supports AVX-512F : [YES]
CPU supports AVX-512VL: [YES]
CPU supports AVX-512BW: [YES]
CPU supports AVX-512CD: [YES]
CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
tsc_freq = 2995.2 MHz (from calibration loop)
CPU brand string: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
2 available CPUs: [0, 1]
Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID                  | Description                       | OVRLP3 |  Mops
1     | pause_only          | pause instruction                 |  1.000 |  1649
1     | ucomis_clean        | scalar ucomis (w/ vzeroupper)     |  1.000 |  1065
1     | ucomis_dirty        | scalar ucomis (no vzeroupper)     |  1.000 |  1065
1     | scalar_iadd         | Scalar integer adds               |  1.000 |  4290
1     | avx128_iadd         | 128-bit integer serial adds       |  1.000 |  4290
1     | avx256_iadd         | 256-bit integer serial adds       |  1.000 |  4290
1     | avx512_iadd         | 512-bit integer serial adds       |  1.000 |  4290
1     | avx128_iadd16       | 128-bit integer serial adds zmm16 |  1.000 |  4290
1     | avx256_iadd16       | 256-bit integer serial adds zmm16 |  1.000 |  4291
1     | avx512_iadd16       | 512-bit integer serial adds zmm16 |  1.000 |  4290
1     | avx128_iadd_t       | 128-bit integer parallel adds     |  1.000 | 12870
1     | avx256_iadd_t       | 256-bit integer parallel adds     |  1.000 | 12870
1     | avx128_xor_zero     | 128-bit zeroing xor               |  1.000 | 21236
1     | avx256_xor_zero     | 256-bit zeroing xor               |  1.000 | 21240
1     | avx512_xor_zero     | 512-bit zeroing xord              |  1.000 | 21231
1     | avx128_mov_sparse   | 128-bit reg-reg mov               |  1.000 |  4290
1     | avx256_mov_sparse   | 256-bit reg-reg mov               |  1.000 |  4290
1     | avx512_mov_sparse   | 512-bit reg-reg mov               |  1.000 |  4291
1     | avx128_merge_sparse | 128-bit reg-reg merge mov         |  1.000 |  4290
1     | avx256_merge_sparse | 256-bit reg-reg merge mov         |  1.000 |  4290
1     | avx512_merge_sparse | 512-bit reg-reg merge mov         |  1.000 |  4290
1     | avx128_vshift       | 128-bit variable shift (vpsrlvd)  |  1.000 |  4290
1     | avx256_vshift       | 256-bit variable shift (vpsrlvd)  |  1.000 |  4290
1     | avx512_vshift       | 512-bit variable shift (vpsrlvd)  |  1.000 |  4290
1     | avx128_vshift_t     | 128-bit variable shift (vpsrlvd)  |  1.000 |  8580
1     | avx256_vshift_t     | 256-bit variable shift (vpsrlvd)  |  1.000 |  8579
1     | avx512_vshift_t     | 512-bit variable shift (vpsrlvd)  |  1.000 |  4290
1     | avx128_vlzcnt       | 128-bit lzcnt (vplzcntd)          |  1.000 |  1073
1     | avx256_vlzcnt       | 256-bit lzcnt (vplzcntd)          |  1.000 |  1073
1     | avx512_vlzcnt       | 512-bit lzcnt (vplzcntd)          |  1.000 |  1073
1     | avx128_vlzcnt_t     | 128-bit lzcnt (vplzcntd)          |  1.000 |  8581
1     | avx256_vlzcnt_t     | 256-bit lzcnt (vplzcntd)          |  1.000 |  8579
1     | avx512_vlzcnt_t     | 512-bit lzcnt (vplzcntd)          |  1.000 |  4290
1     | avx128_imul         | 128-bit integer muls (vpmuldq)    |  1.000 |   858
1     | avx256_imul         | 256-bit integer muls (vpmuldq)    |  1.000 |   858
1     | avx512_imul         | 512-bit integer muls (vpmuldq)    |  1.000 |   858
1     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs        |  1.000 |  4290
1     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs        |  1.000 |  4290
1     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs        |  1.000 |  4290
1     | avx128_fma          | 128-bit serial DP FMAs            |  1.000 |  1073
1     | avx256_fma          | 256-bit serial DP FMAs            |  1.000 |  1073
1     | avx512_fma          | 512-bit serial DP FMAs            |  1.000 |  1073
1     | avx128_fma_t        | 128-bit parallel DP FMAs          |  1.000 |  8579
1     | avx256_fma_t        | 256-bit parallel DP FMAs          |  1.000 |  8580
1     | avx512_fma_t        | 512-bit parallel DP FMAs          |  1.000 |  4290
1     | avx512_vpermw       | 512-bit serial WORD permute       |  1.000 |  1073
1     | avx512_vpermw_t     | 512-bit parallel WORD permute     |  1.000 |  4290
1     | avx512_vpermd       | 512-bit serial DWORD permute      |  1.000 |  1430
1     | avx512_vpermd_t     | 512-bit parallel DWORD permute    |  1.000 |  4290

Cores | ID                  | Description                       | OVRLP3 |         Mops
2     | pause_only          | pause instruction                 |  1.000 |   2830, 2862
2     | ucomis_clean        | scalar ucomis (w/ vzeroupper)     |  1.000 |   1047, 1047
2     | ucomis_dirty        | scalar ucomis (no vzeroupper)     |  1.000 |   1047, 1046
2     | scalar_iadd         | Scalar integer adds               |  1.000 |   3878, 3884
2     | avx128_iadd         | 128-bit integer serial adds       |  1.000 |   3737, 3742
2     | avx256_iadd         | 256-bit integer serial adds       |  1.000 |   3737, 3746
2     | avx512_iadd         | 512-bit integer serial adds       |  1.000 |   3900, 3900
2     | avx128_iadd16       | 128-bit integer serial adds zmm16 |  1.000 |   3746, 3738
2     | avx256_iadd16       | 256-bit integer serial adds zmm16 |  1.000 |   3735, 3744
2     | avx512_iadd16       | 512-bit integer serial adds zmm16 |  1.000 |   3922, 3919
2     | avx128_iadd_t       | 128-bit integer parallel adds     |  1.000 |   6433, 6434
2     | avx256_iadd_t       | 256-bit integer parallel adds     |  1.000 |   6446, 6440
2     | avx128_xor_zero     | 128-bit zeroing xor               |  1.000 | 10619, 10615
2     | avx256_xor_zero     | 256-bit zeroing xor               |  1.000 | 10608, 10619
2     | avx512_xor_zero     | 512-bit zeroing xord              |  1.000 | 10597, 10613
2     | avx128_mov_sparse   | 128-bit reg-reg mov               |  1.000 |   3873, 3878
2     | avx256_mov_sparse   | 256-bit reg-reg mov               |  1.000 |   3871, 3884
2     | avx512_mov_sparse   | 512-bit reg-reg mov               |  1.000 |   3879, 3874
2     | avx128_merge_sparse | 128-bit reg-reg merge mov         |  1.000 |   3877, 3879
2     | avx256_merge_sparse | 256-bit reg-reg merge mov         |  1.000 |   3878, 3877
2     | avx512_merge_sparse | 512-bit reg-reg merge mov         |  1.000 |   3879, 3878
2     | avx128_vshift       | 128-bit variable shift (vpsrlvd)  |  1.000 |   3914, 3915
2     | avx256_vshift       | 256-bit variable shift (vpsrlvd)  |  1.000 |   3915, 3917
2     | avx512_vshift       | 512-bit variable shift (vpsrlvd)  |  1.000 |   2095, 2095
2     | avx128_vshift_t     | 128-bit variable shift (vpsrlvd)  |  1.000 |   4292, 4293
2     | avx256_vshift_t     | 256-bit variable shift (vpsrlvd)  |  1.000 |   4284, 4291
2     | avx512_vshift_t     | 512-bit variable shift (vpsrlvd)  |  1.000 |   2090, 2091
2     | avx128_vlzcnt       | 128-bit lzcnt (vplzcntd)          |  1.000 |   1072, 1072
2     | avx256_vlzcnt       | 256-bit lzcnt (vplzcntd)          |  1.000 |   1072, 1072
2     | avx512_vlzcnt       | 512-bit lzcnt (vplzcntd)          |  1.000 |   1072, 1072
2     | avx128_vlzcnt_t     | 128-bit lzcnt (vplzcntd)          |  1.000 |   4299, 4295
2     | avx256_vlzcnt_t     | 256-bit lzcnt (vplzcntd)          |  1.000 |   4287, 4307
2     | avx512_vlzcnt_t     | 512-bit lzcnt (vplzcntd)          |  1.000 |   2089, 2092
2     | avx128_imul         | 128-bit integer muls (vpmuldq)    |  1.000 |    858,  858
2     | avx256_imul         | 256-bit integer muls (vpmuldq)    |  1.000 |    858,  858
2     | avx512_imul         | 512-bit integer muls (vpmuldq)    |  1.000 |    858,  858
2     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs        |  1.000 |   3877, 3877
2     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs        |  1.000 |   3880, 3878
2     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs        |  1.000 |   3877, 3874
2     | avx128_fma          | 128-bit serial DP FMAs            |  1.000 |   1072, 1072
2     | avx256_fma          | 256-bit serial DP FMAs            |  1.000 |   1072, 1072
2     | avx512_fma          | 512-bit serial DP FMAs            |  1.000 |   1072, 1072
2     | avx128_fma_t        | 128-bit parallel DP FMAs          |  1.000 |   4293, 4280
2     | avx256_fma_t        | 256-bit parallel DP FMAs          |  1.000 |   4285, 4294
2     | avx512_fma_t        | 512-bit parallel DP FMAs          |  1.000 |   2089, 2091
2     | avx512_vpermw       | 512-bit serial WORD permute       |  1.000 |   1069, 1069
2     | avx512_vpermw_t     | 512-bit parallel WORD permute     |  1.000 |   2145, 2146
2     | avx512_vpermd       | 512-bit serial DWORD permute      |  1.000 |   1430, 1430
2     | avx512_vpermd_t     | 512-bit parallel DWORD permute    |  1.000 |   2149, 2142

Additionally, I used on a U4711 notebook a Debian 12 VM with Firefox the MotionMark 1.3.1 from browserbench.org, as already used by @jschlatow during his browser performance analysis on Genodians.org. Even so the results are not very stable and fluctuate, it looks as it seems to have a positive effect, best results:

w/o  AVX, but with SSE*:  23.74 @ 60fps +- 224.84 %
with AVX commits       : 167.28 @ 60fps +-  29.97 %

alex-ab commented 3 months ago

Additionally, I downloaded a video from jellyfish, https://repo.jellyfin.org/jellyfish/jellyfish-30-mbps-hd-h264.mkv, and used ffmpeg to transcode the file, in order to see some impact. The both files are attached, and the diff of the output is below. Some improvements are visible.

ffmpeg -benchmark -i jellyfish-30-mbps-hd-h264.mkv -c:v libx265 -preset medium -crf 20 -c:a copy jellyfish-30-mbps-hd-h265-crf20.mkv

x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
encoded 900 frames in 300.06s (3.00 fps), 11242.09 kb/s, Avg QP:24.48

x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3
encoded 900 frames in 243.94s (3.69 fps), 11242.09 kb/s, Avg QP:24.48

sse_ffmpeg_30.txt avx_ffmpeg_30.txt

alex-ab commented 3 months ago

Another test from the Phoronix test suite, e.g. Bosphorus, manually executed (so not using the test suite), shows following results. The traces and command invocation are part of the attached log files.

x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
encoded 600 frames in 69.54s (8.63 fps), 1271.47 kb/s, Avg QP:33.68

x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3
encoded 600 frames in 52.36s (11.46 fps), 1271.47 kb/s, Avg QP:33.68

sse_x265_Bosphorus_1920x1080.txt avx_x265_Bosphorus_1920x1080.txt

chelmuth commented 3 months ago

Merged https://github.com/genodelabs/genode/commit/f5a9d5e65f98450dea0604f36c00f20adbff548a and https://github.com/genodelabs/genode-world/commit/eee31d2801f96689f05a6a667aeed08a47852738 to staging.

chelmuth commented 3 months ago

depot_autopilot/test-pthread failed last night with #UD on x86_64.

[2024-08-13 03:30:11] [init -> depot_autopilot] 1.308 [init -> test-pthread] main thread: start PTHREAD_MUTEX_NORMAL stress test
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.305', cpu 2, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.310', cpu 5, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.307', cpu 6, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.311', cpu 7, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.309', cpu 3, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.308', cpu 1, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:11] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.306', cpu 4, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:30:15] Warning: unresolvable exception 6, pd 'init -> dynamic -> test-pthread -> test-pthread', thread 'pthread.303', cpu 7, ip=0x78a53 sp=0x405fed80 bp=0x898e0 no signal handler
[2024-08-13 03:31:40] [init -> depot_autopilot] 
[2024-08-13 03:31:40] [init -> depot_autopilot]  test-pthread                    failed    89.987  timeout 90 sec

Same occurred with AVX patches from 2024-08-06 at 2024-08-07 03:51:53.

alex-ab commented 2 months ago

I added the commits to get AVX working with vbox6, tested with a debian, ubuntu and win10 VM on a modular sculpt.

chelmuth commented 1 month ago

@alex-ab would you mind to record the remaining problems with avx-turbo in this issue? I agree that we don't have to fix them if they are specific to the use of the tool only and don't happen in real scenarios.

alex-ab commented 1 month ago

@alex-ab would you mind to record the remaining problems with avx-turbo in this issue? I agree that we don't have to fix them if they are specific to the use of the tool only and don't happen in real scenarios.

I found the issue with the test. It divides on TSC frequency calculation by 0 which fails. I added a patch for in vbox6 usage. Instead of reading out the frequency (which is not provided by vbox6), it measures it and then the whole AVX test works.

avx_turbo_tsc_calc.txt

--- a/tsc-support.cpp
+++ b/tsc-support.cpp
@@ -41,7 +41,8 @@ uint64_t get_tsc_from_cpuid_inner() {

     if (family.family == 6) {
-        if (family.model == 0x4E || family.model == 0x5E || family.model == 0x8E || family.model == 0x9E) {
+        printf("%s:%u division by %u is not good !!!\n", __func__, __LINE__, cpuid15.eax);
+        if (cpuid15.eax && (family.model == 0x4E || family.model == 0x5E || family.model == 0x8E || family.model == 0x9E)) {
             // skylake client or kabylake
             return (int64_t)24000000 * cpuid15.ebx / cpuid15.eax; // 24 MHz crystal clock
         }

alex-ab commented 1 month ago

@chelmuth: please add the fixup and the aes commit to staging from my staging branch

chelmuth commented 1 month ago

Thanks, merged to staging.

cnuke commented 1 month ago

FWIW, 4903595 enables RDRAND and RDSEED. I've been using the commit for some time now w/o any noticeable problems.

chelmuth commented 1 month ago

Can you report any positive performance (or other) impact?

cnuke commented 1 month ago

@chelmuth well, I did not perform any testing so I cannot comment either way (especially as I have not enabled them in isolation, i.e. AVX/AES was already enabled and could skew the results).

genodelabs / genode

x86: add support to use AVX* CPU features #5314