Implement parallel CPU cores bring-up for DRTM launch

BeataZdunczyk commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently, the CPU cores are being woken up in parallel for DRTM launch in Qubes OS AEM, but later they are hacked to be waiting in a queue. If any interrupt would come at that time, it could be a serious danger. This needs to be fixed as soon as possible as required by Intel TXT specification.

Is your feature request related to a new idea or technology that would benefit the project? Please describe.

This task is required to ensure the proper functioning of parallel CPU cores bring-up for DRTM launch in Qubes OS AEM as per Intel TXT specification.

Describe the solution you'd like

Implement parallel CPU cores bring-up for DRTM launch in Qubes OS AEM in compliance with the Intel TXT specification.

Describe alternatives you've considered

N/A

Additional context

This feature request is part of Phase 2 in TrenchBoot as Anti Evil Maid project, as outlined in the documentation: https://docs.dasharo.com/projects/trenchboot-aem-v2/.

Relevant documentation you've consulted

N/A

krystian-hebel commented 1 year ago

Initial work happens on https://github.com/TrenchBoot/xen/tree/smp_cleanup.

Currently assembly part already works in parallel, but it may still require some minor changes around hand-off to C. e.g. Xen CPU number may be passed as argument to start_secondary instead of depending on static variable.

On C side, some of the code (cpu_up in particular) is common for x86 and Arm, which makes it impossible to change the flow of one of these architectures without touching the other. As Arm isn't the target of AEM project, I think it will be best to stick to changes to code specific to x86.

cpu_up in the current form is also used for hot-plugging CPUs on server platforms, a feature that must be preserved. However, the initial INIT-SIPI-SIPI sequence can be broadcast by BSP to all-but-self. For platforms with many cores/threads this can significantly reduce the number of inter-processor calls and delays. Remaining initialization steps (from CPU_STATE_CALLOUT onward) would still be executed one by one to keep the logic implemented by cpu_up - while it could be possible to make those parallel as well, this would add a lot of duplicated code which is hard to maintain.

With parallel execution some timeouts will have to be rethought, most notably APs waiting for CPU_STATE_CALLOUT.

krystian-hebel commented 1 year ago

Opened PR. I also planned to add comparisons of boot times with and without those changes, but we're experiencing some infrastructure issues and I couldn't connect to most interesting platforms.

As said before, there is a bit of code common to x86 and Arm. Between breaking that code into smaller parts or calling notifiers for CPU_UP_PREPARE twice (which in some cases would allocate memory twice, or overwrite structures that could potentially be in use) I decided to let most of the initialization not be parallel. APs start polling cpu_state basically as soon as they enter C world.

This is list of functions/macros called from start_secondary() with their dependencies on notifiers that I managed to find, up to smp_callin():

set_current(idle_vcpu[cpu]);            // common/sched/core.c:cpu_schedule_up(), which in turn needs
                                        // arch/x86/percpu.c:init_percpu_area() and
                                        // common/timer.c:cpu_callback() - should be no-op after BSP calls it
this_cpu(curr_vcpu) = idle_vcpu[cpu];   // arch/x86/percpu.c:init_percpu_area(), idle_vcpu populated by above
rdmsrl(MSR_EFER, this_cpu(efer));       // percpu
load_system_tables();                   // percpu + arch/x86/smpboot.c:cpu_smpboot_alloc()
percpu_traps_init();                    // percpu

percpu is already sufficiently protected, it won't do anything when called multiple times. smpboot is slightly more complicated, it would reallocate memory for per_cpu(compat_gdt, cpu) and per_cpu(root_pgt, cpu). cpu_schedule_up also allocates resources without checking if they already exist. Given enough time, all of those probably could be resolved, which may allow for better parallelization. However, most of those callbacks are declared static, in order to use them outside of their respective compilation units they would have to be made globally visible, which maybe isn't great idea. They can't be executed at the time callbacks are registered, that would break the order in which they must get executed.

krystian-hebel commented 11 months ago

This also seems to be working according to logs linked in https://github.com/TrenchBoot/trenchboot-issues/issues/11#issuecomment-1734465464. Brought up 4 CPUs printed by Xen and smpboot: Allowing 4 CPUs, 0 hotplug CPUs printed by kernel prove that all CPU cores are up and running.

krystian-hebel commented 10 months ago

Testing on OptiPlex with older CPU (IvyBridge, 3rd generation Intel Core) also doesn't show any regression.

lscpu

pietrushnic commented 10 months ago

@krystian-hebel IIUC, do you test Dasharo (coreboot+SeaBIOS) ? It would be great to know what the test target is here. Most likely as a part of the marketing effort, I would like to try that on my own hardware.

TrenchBoot / trenchboot-issues

Implement parallel CPU cores bring-up for DRTM launch #12