0xAX / linux-insides

A little bit about a linux kernel
http://0xax.gitbooks.io/linux-insides/content/index.html
Other
29.92k stars 3.37k forks source link

Why the kernel needs the pagetable `early_dynamic_pgts`? #544

Open hao-lee opened 6 years ago

hao-lee commented 6 years ago

Hi,

I have finished Kernel initialization Part 1, but I still have some questions. Could you please give me some hints? Many Thanks.

In arch/x86/kernel/head_64.S, several pagetables are defined. After reading this part, I think the early paging is handled by 3 tables:

(PGD)early_level4_pgt -> (PUD)level3_kernel_pgt -> (PMD)level2_kernel_pgt

The PMD table level2_kernel_pgt is filled with 256 entries, so it can map 512MB pyhsical space [0, 512MB).

If a virtual address is 0xffffffff81000000, these pagetables can map it to physical address 0x1000000. This is very straightforward. (I hope my understanding is correct)

However, I noticed that from the label early_dynamic_pgts there are 2 tables are also filled. I think they are PUD and PMD too and are used to map the kernel from _text to _end. I don't know why these two tables are needed. After all, we already have three tables which can map 512MB physical space.

danix800 commented 6 years ago

On x86_64:

At the early (not very first) stages, early_dynamic_pgts is used as PUD (first 512 entries) and PMD (remaining entries) for mapping __PAGE_OFFSET:

va = ffff880000000000, mode = ia32e, 2M page

entry shift size offset decimal
pgof 0 0x200000 0x0 0
L2(pmd) 21 0x200 0x0 0
L3(pud) 30 0x200 0x0 0
L4(pgd) 39 0x200 0x110 272

So __PAGE_OFFSET is mapped with early_top_pgt[272], which points to early_dynamic_pgts. If you debug with gdb you can verify this by:

(gdb) x/zg &early_top_pgt[272]

This entry should point to early_dynamic_pgts. And if you follow the paging mechnism you'll get to PMD level mapping the 2M pages.

The kernel code is mapped through:

va = ffffffff80000000, mode = ia32e, 2M page

entry shift size offset decimal
pgof 0 0x200000 0x0 0
L2(pmd) 21 0x200 0x0 0
L3(pud) 30 0x200 0x1FE 510
L4(pgd) 39 0x200 0x1FF 511

Verify:

(gdb) x/zg &early_top_pgt[511]

This should point to level3_kernel_pgt, and

(gdb) x/zg &level3_kernel_pgt[510]

should point to level2_kernel_pgt. This is the 2M PMD pages.

danix800 commented 6 years ago

Refer to https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt

hao-lee commented 6 years ago

@danix800

Thanks for your reply, but you may misunderstand my question.

early_dynamic_pgts is used to map __PAGE_OFFSET only after the early page fault handler is set. What I want to know is the identity mapping.

early_level4_pgt is renamed to early_top_pgt in latest kernel, but I will still use the former to illustrate.

In Identity mapping setup, the kernel uses the first two entries in early_level4_pgt and use two tables starting from early_dynamic_pgts as PUD and PMD. As a result, these three tables map the kernel from _text to _end.

                                                              +------------+ _end
                                                              |            |
                                                              |            |
                                                              |            |
                                                              |  kernel    |
                                                              |  text      |
                      ---+--------------+                     |            |
                         |              |                     |            |
                         |              |                     |            |
                         +--------------+                     |            |
                     PUD |  entry 8     +-------------------> +------------+ _text
                         +--------------+
                         |              |
                         |              |
                      ------------------+
                         |              |
                         |              |
                         |              |
                     PMD |              |
                         |              |
                         +--------------+
                         |   entry 0    |
early_dynamic_pgts+---------------------+
                         |              |
                         |              |
                         |              |
                     PGD |              |
                         +--------------+
                         |   entry 0    |
  early_level4_pgt+------+--------------+

I don't know why this mapping is needed. After deleting these code and recompile my kernel, everything is ok. I can still boot my system normally.

danix800 commented 6 years ago

This identity mapping is for page table switching. If no such mapping the IP will be invalid after setting cr3. Can you verify your test and feedback to us?

hao-lee commented 6 years ago

Hi, @danix800 Thank you very much! I didn't realize the running of the following two instructions needs a temporary mapping.

    /* Ensure I am executing from virtual addresses */
    movq    $1f, %rax
    jmp *%rax

Thanks for your help! I have understood why this mapping is necessary.


I have debugged my kernel step by step in Bochs and have found a strange behavior. As I said above, I delete these code and recompiled my kernel and run it in Bochs. After cr3 being set to point to early_level4_pgt, Bochs prompts me that it can't display the physical address of the above two instructions because the page tables(ie. PUD, PMD) don't exist.

[333497746] ??? (physical address not available)

I ignore these warnings and make the kernel continue running the code. I find the kernel can reach movl $0x80000001, %eax successfully.

The following code is copied from here.


/* Setup early boot stage 4 level pagetables. */
addq    phys_base(%rip), %rax
movq    %rax, %cr3  /* pagetable switching */
/* Ensure I am executing from virtual addresses */
movq    $1f, %rax   /* Bochs prompts: physical address not available */
jmp *%rax       /* Bochs prompts: physical address not available */

1:

/* Check if nx is implemented */

reach-> movl $0x80000001, %eax / Bochs can reach here successfully!!! Everything is OK! / cpuid movl %edx,%edi



I guess that the Bochs detects the error and continue fetching instructions from physical memory even though it doesn't know what would happen. I have tested my kernel with VMware and QEMU, the former can also boot successfully but QEMU can't. I think this behavior may relate to CPU.
danix800 commented 6 years ago

I'm investigating this too. For QEMU when kvm is enabled (--enable-kvm) the kernel can boot also. So I think there's some page fault handling under the hood by KVM.

arch/x86/kvm/mmu.c has page fault handling, that might be where the real magic happens. I'm not sure.

hao-lee commented 6 years ago

My Bochs and VMware don't have any KVM mechanism. Things get a little more interesting.

hao-lee commented 6 years ago

Hi, @danix800 I accidentally saw your question on StackOverflow. I also sent an email to the linux-mm mailing list, but nobody replied me.

danix800 commented 6 years ago

Yes nobody seems to be interested. I think it's all on us now. Currently I'm studying GRUB, I'll dig into it when I have time.

danix800 commented 6 years ago

I actually digged a little a few days ago, I've already setup a debugging environment and can break the KVM code on the exact faulting instruction.

But without deep understanding of KVM it's difficult to unearth all what's going on so I gave up for now.

On qemu-devel list nobody replies. On linux-kernel list, here, also there's no useful info available.

Happy debugging!

hao-lee commented 6 years ago

I will also keep watching this question and hope that we can solve it in the future. :smiley:

fangzhen commented 1 year ago

Years since last comments, I'm also running into here :-)

I think the behavior is related to TLB, as linux-kernel list indicates.

I made some test on kernel v6.2 source with qemu. The related code including:

  1. setup ideneity_mapping
  2. flush TLB
no -enable-kvm -enable-kvm
delete 1 boot fail boot fail
delete 1 & 2 boot fail boot success

In -enable-kvm case, If we don't setup the identity mappping, and don't flush TLB, kernel boot successfully. If we flush TLB, kernel boot failed. It make sense if TLB caches the identity mapping page table, no page fault. And if TLB is flushed, page fault occurs.

If no -enable-kvm, my guess is qemu don't emulate TLB the same as hardware TLB. As a result, page fault always occurs.

However, this is more of a obscure guess than a solid proof.