Harvey-OS / harvey

A distributed operating system
https://harvey-os.org/
MIT License
1.44k stars 106 forks source link

Pagefault calling parsersdptr #1070

Closed gmacd closed 3 years ago

gmacd commented 3 years ago

Since the new VM has gone in, one of my machine no longer boots.

It seems to be pagefaulting in parsersdptr, when calling sdtmap or thereabouts. The callstack and registers can be seen here. CR2 points to 0xffff800036d8a09f.

regs

Now the initial memory map obtained from multiboot is: mmap

If I look at the memory map in linux covering the APCI tables I get:

   2f382000-35884fff : System RAM
>> 35885000-36d7efff : Reserved
>> 36d7f000-36db7fff : ACPI Tables
>> 36db8000-37713fff : ACPI Non-volatile Storage
>> 37714000-37ffefff : Reserved
   37fff000-37ffffff : System RAM

So it looks like the ACPI tables should be in the second range of the memory map obtained from multiboot.

The pamap is: pamap

So it seems that at some point after the pamap is populated from multiboot, the range is broken up and the range covering the ACPI tables is lost.

I've dumped out the pamap calls:

AC54F032-EC69-4F16-850F-D17B0400F1E0

432093CE-B80F-4571-8067-4333651761C1

Using a bit of OCR (hopefully accurately). (mbX) means the call came from multiboot, idx X. (mergeX means the call came from the pamapmerge function). The arrows and indentation indicate the level the call came from - e.g. insert calls clearrange, which calls new

[0]  Insert(0x0000000000000000, 641024, MEMORY)  (mb0)
[1]  --> New(0x0000000000000000, 641024, MEMORY)

[2]  Insert(0x0000000000100000, 2125910016, MEMORY)  (mb1)
[3]  --> ClearRange(0x0000000000100000, 2125910016, MEMORY)
[4]  --> New(0x0000000000100000, 2125910016, MEMORY)

[5]  Insert(0x000000007ecb8000, 266240, MEMORY)  (mb2)
[6]  --> ClearRange(0x000000007ecb8000, 266240, MEMORY)
[7]  --> New(0x000000007ecbE8000, 266240, MEMORY)

[8]  Insert(0x000000007f382000, 105918464, MEMORY)  (mb3)
[9]  --> ClearRange(0x000000007f382000, 105918464, MEMORY)
[10] --> New(0x0000000007f382000, 105918464, MEMORY)

[11] Insert(0x0000000087FFF000, 4096, MEMORY)  (mb4)
[12] --> ClearRange(0x0000000087fFFF000, 4096, MEMORY)
[13] --> New(0x0000000087FFF000, 4096, MEMORY)

[14] Insert(0x0000000100000000, 14873001984, MEMORY)  (mb5)
[15] --> ClearRange(0x0000000100000000, 14873001984, MEMORY)
[16] --> New(0x0000000100000000, 148673001984, MEMORY)

[17] Merge()
[18] Insert(0x0000000000080000, 131072, KRDWR)  (merge0)
[19] --> ClearRange(0x0000000000080000, 131072, KRDWR)
[20]     --> New(0x0000000000080000, 116736, MEMORY)
[21] --> New(0x0000000000080000, 131072, KRDWR)

[22] Insert(0x00000000000a0000, 131072, DEV)  (merge1)
[23] --> ClearRange(0x00000000000a0000, 131072, DEV)
[24] --> New(0x00000000000a0000, 131072, DEV)

[25] Insert(0x00000000000c0000, 196608, KRDONLY)  (merge2)
[26] --> ClearRange(0x00000000000c0000, 196608, KRDONLY)
[27] --> New(0x00000000000c0000, 196608, KRDONLY)

[28] Insert(0x00000000000f0000, 65536, KRDONLY)  (merge3)
[29] --> ClearRange(0x00000000000f0000, 65536, KRDONLY)

[30] Insert(0x0000000000101000, 1044480, KRDWR)  (merge4)
[31] --> ClearRange(0x0000000000101000, 1044480, KRDWR)
[32]     --> New(0x0000000000101000, 2125905920, MEMORY)
[33] --> New(0x0000000000101000, 1044480, KRDWR)

[34] Insert(0x0000000000200000, 2097152, KTEXT)  (merge5)
[35] --> ClearRange(0x0000000000200000, 2097152, KTEXT)
[36] --> New(0x0000000000200000, 2097152, KTEXT)

[37] Insert(0x0000000000400000, 2097152, KRDONLY)  (merge6)
[38] --> ClearRange(0x0000000000400000, 2097152, KRDONLY)
[39] --> New(0x0000000000400000, 2097152, KRDONLY)

[40] Insert(0x0000000000600000, 7024664, KRDWR)  (merge7)
[41] --> ClearRange(0x00000000000600000, 7024664, KRDWR)
[42] --> New(0x0000000000600000, 7024664, KRDWR)

[43] Insert(0x0000000000cb3018, 292868072, KRDWR)  (merge7)
[44] --> ClearRange(00x0000000000cb3018, 292868072, KRDWR)
rminnich commented 3 years ago

This is not enough information unfortunately. What you need to do is get to a full boot and then cp /dev/kmesg /usr/harvey and then you'll have the kmesg file on your server (whatever system is running centre IOW) and can post it here.

But it's crashing! How can you boot? Disable ACPI :-)

diff --git a/sys/src/9/amd64/devacpi.c b/sys/src/9/amd64/devacpi.c
index 0bffbca36..702fdad99 100644
--- a/sys/src/9/amd64/devacpi.c
+++ b/sys/src/9/amd64/devacpi.c
@@ -1631,6 +1631,7 @@ parsersdptr(void)
         * Search for the data structure signature:
         * 1) in the BIOS ROM between 0xE0000 and 0xFFFFF.
         */
+       return;
        rsd = rsdsearch(KADDR(0xE0000), 0x20000);
        if(rsd == nil){
                print("NO RSDP\n");

Let us know what you see then. I need to add a debug option so we can disable ACPI on boot.

gmacd commented 3 years ago

Hah, how did I not think of that :)

gmacd commented 3 years ago

Having run through by hand, we should end up with

0x0000000000000000-0x0000000000080000 (1:MEMORY)
0x0000000000080000-0x00000000000a0000 (10:KRDWR)
0x00000000000a0000-0x00000000000c0000 (6:DEV)
0x00000000000c0000-0x0000000000100000 (6:KRDONLY)
0x0000000000100000-0x0000000000101000 (1:MEMORY)
0x0000000000101000-0x0000000000200000 (10:KRDWR)
0x0000000000200000-0x0000000000400000 (8:KTEXT)
0x0000000000400000-0x0000000000600000 (6:KRDONLY)
0x0000000000600000-0x0000000012400000 (10:KRDWR)
0x0000000012400000-0x000000007ec6d000 (1:MEMORY)  (<--- 0x50000000 more than we get with current code)
0x000000007ecb8000-0x000000007ecf9000 (1:MEMORY)
0x000000007f382000-0x0000000085885000 (1:MEMORY)
0x0000000087fff000-0x0000000088000000 (1:MEMORY)
0x0000000100000000-0x0000000476800000 (1:MEMORY)

But we don't. We seem to have lost 0x50000000 probably when merging nodes.

This happens in the last pamapinsert, which:

gmacd commented 3 years ago

Here's the test that didn't replicate the bug:

int main(int argc, char const *argv[])
{
    printf("testing pamap\n");

    pamapinsert(0x0000000000000000, 0x9C800, PamMEMORY);
    pamapinsert(0x0000000000100000, 0x7EB6D000, PamMEMORY);
    pamapinsert(0x000000007ecb8000, 0x41000, PamMEMORY);
    pamapinsert(0x000000007f382000, 0x6503000, PamMEMORY);
    pamapinsert(0x0000000087FFF000, 0x1000, PamMEMORY);
    pamapinsert(0x0000000100000000, 0x376800000, PamMEMORY);

    printf("before merge\n");
    pamapdump();

    // merge
    pamapinsert(0x0000000000080000, 0x20000, PamKRDWR);
    pamapinsert(0x00000000000a0000, 0x20000, PamDEV);
    pamapinsert(0x00000000000c0000, 0x30000, PamKRDONLY);
    pamapinsert(0x00000000000f0000, 0x10000, PamKRDONLY);
    pamapinsert(0x0000000000101000, 0xFF000, PamKRDWR);
    pamapinsert(0x0000000000200000, 0x200000, PamKTEXT);
    pamapinsert(0x0000000000400000, 0x200000, PamKRDONLY);
    pamapinsert(0x0000000000600000, 0x6B3018, PamKRDWR);
    pamapinsert(0x0000000000cb3018, 0x1174CFE8, PamKRDWR);

    printf("after merge\n");
    pamapdump();

    printf("done testing pamap\n");
    return 0;
}
dancrossnyc commented 3 years ago

It's kind of weird that it is being treated as type "memory". I'd expected KRDONLY or something. I wonder if it's actually being mapped, but is being zeroed or something before the ACPI data is actually read.

We should, perhaps, read the ACPI data and squirrel it away before tearing down the original memory map.

gmacd commented 3 years ago

So it's an indirect problem. Closing.

(Here's the story)

I've been having problems netbooting the machine I'm trying to test with. It worked in the past (several months ago), by changing to legacy mode, but not now.

So I copied ipxe to a USB stick and was able to boot with that. It was a big annoying though - it would only try to boot after the bios tried to netboot for about a minute or two and timed out. No fiddling with bios settings to get around this. I don't remember it being a problem last time.

(I'm using centre btw)

Looking in wireshark, I saw a ProxyDHCP packet and a message on my client PC saying 'ProxyDHCP did not reply to request on port 4011'.Googling that, I see 'This problem can occur when the DHCP Class Identifier Option 60 is set on the DHCP server, but there is no proxyDHCP service running on port 4011 on the same machine.'

I remembered @fhs had made a change to centre recently, saying that a change I'd made had stopped it booting on his thinkpad. The change I'd made was to specify option 60 as 'PXEClient'.

Updating centre, it now netboots immediately, without needing ipxe on the USB stick.So that's that problem solved, but it also seems to have solved the pamap problem....

Harvey now boots up fully, with a much larger set of physical memory regions (including acpi regions) picked up by multiboot.

So somehow netbooting this machine via ipxe on a stick means multiboot doesn't pick up most of the regions.

So the root cause for the page fault when trying to setup acpi, was a misconfiguration in the DHCP settings in centre.