patch to support EPYC 9654P cpu

Meandres commented 5 months ago

We are running OSv on a server equipped with an EPYC 9654P CPU. In the process, we encountered several minor issues that we tried to fix with this patch. To the best of our knowledge, these changes do not introduce breaking changes on other systems. I don't exactly know if they are worthy of being merged, but at least it is documented (should I open an issue instead ?)

The first of them is the setting of mmu::max_phys_bits to 51 (here: https://github.com/cloudius-systems/osv/blob/master/arch/x64/arch-mmu.hh#L14). The 9654P supports physical addresses with 52 bits, and I guess cpuid reports it. Removing the assert and simply setting mmu::phys_bits to mmu::max_phys_bits if it exceeds works both on the 9654P as well as older EPYC and Xeons.

The second is a bit more hidden. When running a very memory-heavy benchmark, we noticed that the CPU was stalling in the front end (instead of the expected stalls in the back end). We narrowed down the issue to a strange bug where the micro-op cache seems to be disabled if not setting the MP bit of the CR0 (https://wiki.osdev.org/CPU_Registers_x86#CR0) during boot. So far, we have not been able to understand why this happens, but it does. We used the following program as a minimal example:

#include <cstdint>

int main(){
    for(uint64_t i=0; i<1e12; i++){
        asm volatile("addq $42, %%rax": : : "memory");
    }
    return 0;
}

We gathered the PMC using the following command: sudo perf kvm stat -M op_cache_fetch_miss_ratio -p $(pidof qemu-system-x86_64) -- sleep 1

Running OSv with a single vcpu pinned to a pcpu (SMT disabled), the output of the above command without the MP bit set gives:

Performance counter stats for process id '528565':

     3,381,772,998      op_cache_hit_miss.op_cache_miss  #    100.0 %  op_cache_fetch_miss_ratio
     3,381,772,998      op_cache_hit_miss.all_op_cache_accesses

       1.001941263 seconds time elapsed

and with the bit set:

Performance counter stats for process id '527893':

         3,646,646      op_cache_hit_miss.op_cache_miss  #      0.1 %  op_cache_fetch_miss_ratio
     3,339,928,235      op_cache_hit_miss.all_op_cache_accesses

       1.001967469 seconds time elapsed

wkozaczuk commented 4 months ago

If I understand correctly, this PR addresses two separate issues. I prefer 2 separate PRs. The mmu::phys_bits/ mmu::max_phys_bits seems like a no-brainer to apply. I'm not 100% sure about the MP one, given you do not understand related behavior. @nyh what do you think?

nyh commented 4 months ago

@nyh what do you think?

Wow, I had to look up the MP bit in the documentation, I didn't even remember it was in use 30 years ago with the 80287 math coprocessor ;-)

I looked at the Linux kernel, and they have:

        /* Setup cr0 */
        movl    $CR0_STATE, %eax
        /* Make changes effective */
        movq    %rax, %cr0

#define CR0_STATE       (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
                         X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
                         X86_CR0_PG)

In contrary, we (OSv) currently set only PE, WP, PG. I don't remember why we chose this particular choice (the bits were chosen by @avikivity in 74f3715ad and then changed to use symbolic bit names by @penberg in f77a8c320 - and since then weren't touched).

Besides MP, we also don't set ET, NE, and AM. ET and NE are also related to the math coprocessor (I don't any recollection of any details), and AM is about generating an exception on unaligned accesses - I'm not even sure it's relevant in modern processors but I don't think OSv had any problems with it in the past.

Anyway, I think adding MP is safe, and we can merge this patch as-is, but I have to admit I don't fully understand the implications, or why it's even relevant for the EPYC CPU.

avikivity commented 4 months ago

Wow, well done on tracing such an esoteric issue. How did you think of looking at cr0.mp?

Meandres commented 4 months ago

As far as we know the issue is still present with some (desktop) Ryzen CPUs. We also tried on a machine with an EPIC 7713P and no error so it could be some kind of bug somewhere. @avikivity this uop cache is not enabled at linux entry point. The rest is just stepping into the linux early process to find when it gets enabled.

cloudius-systems / osv

patch to support EPYC 9654P cpu #1319