0vercl0k / kdmp-parser

A Windows kernel dump C++ parser library with Python 3 bindings.
MIT License
193 stars 29 forks source link

pip / setup.py #14

Closed 0vercl0k closed 9 months ago

0vercl0k commented 3 years ago

Explore the possibility to use a setup.py and to have kdmp available on pip

hugsy commented 1 year ago

Any update on this by any chance? I can adjust the work I did for udmp-parser to use nanobind for this project too if this help. LMK

0vercl0k commented 1 year ago

I don't have any short term plan to look at this - I looked at it a little bit right after your work on udmp-parser but I had forgotten that a custom Python extension already existed for kdmp-parser and I couldn't get the CMakefile / pip building working easily so I kind of dropped it.

It might not be hard - it was more that I don't know those technologies very well and working with them can be frustrating :-D.

But yeah if you want to give it a shot, knock yourself out!

Cheers

hugsy commented 1 year ago

Thanks for the reply. I should be able to prep you something quickly.

neitsa commented 10 months ago

Hey everyone :wave: Thanks a lot for your work Axel, and for your contributions @hugsy. I just wanted to take this opportunity to say that I would really be interested in having kdmp-parser available on pip, if possible. Thanks :)

hugsy commented 10 months ago

Hey @neitsa !

I think

python -m pip install git+https://github.com/hugsy/kdmp-parser.git

Should work out of the box for the time being. It's not precompiled tho, so you'll need all the building stuff

0vercl0k commented 10 months ago

Hey all,

Yep I am planning to take a look & working to merge crazy hugsy's change in December - I am currently traveling and that is why I haven't gotten a chance to do anything about it :)

But I have not forgotten and I'll get on it as soon as I get back.

Cheers and thanks for dropping us a line Neitsa!

On Tue, Oct 31, 2023 at 3:37 PM crazy hugsy @.***> wrote:

Hey @neitsa https://github.com/neitsa !

I think

python -m pip install git+https://github.com/hugsy/kdmp-parser.git

Should work out of the box for the time being

— Reply to this email directly, view it on GitHub https://github.com/0vercl0k/kdmp-parser/issues/14#issuecomment-1787348675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALIORJWBVAXG3ALDE4TKLTYCEEJDAVCNFSM4VLEVYNKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYG4ZTIOBWG42Q . You are receiving this because you authored the thread.Message ID: @.***>

hugsy commented 10 months ago

Enjoy your travels buddy 😀

@neitsa if you happen to try the non-yet-merged version let me know how it goes! I can use that time to fix bugs to make the review faster for @0vercl0k

🍻

neitsa commented 10 months ago

No worries Axel! Enjoy :)

@hugsy : compilation and usage went without any hassle 👍 . Fought a bit with pip to get it to compile from the src/python directory (otherwise it awaits for a setup.py or .toml file at the root of the repo) until I read the fine manual.

Works like a charm :) Thanks a lot guys!

(.venv) PS G:\git\python> python -m pip install -e "git+https://github.com/hugsy/kdmp-parser.git/#egg=kdmp-parser&subdirectory=src/python"
Obtaining kdmp-parser from git+https://github.com/hugsy/kdmp-parser.git/#egg=kdmp-parser&subdirectory=src/python
  Cloning https://github.com/hugsy/kdmp-parser.git/ to G:\git\python\.venv\src\kdmp-parser
  Running command git clone --filter=blob:none --quiet https://github.com/hugsy/kdmp-parser.git/ 'G:\git\python\.venv\src\kdmp-parser'
  Resolved https://github.com/hugsy/kdmp-parser.git/ to commit c48d801d6fcb296b028e620b1d0bdf37f4c1da34
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Installing backend dependencies ... done
  Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: kdmp-parser
  Building editable for kdmp-parser (pyproject.toml) ... done
  Created wheel for kdmp-parser: filename=kdmp_parser-0.5.0-cp311-cp311-win_amd64.whl size=77896 sha256=28e24a04301fa2382224fd2771fb8d254c37c3d149c4146898dfdc1b9f3bb550
  Stored in directory: C:\Users\neitsa\AppData\Local\Temp\pip-ephem-wheel-cache-jl97kngr\wheels\78\08\93\918a183b38d3dfb1b3beb8d66733a5f307ca03f2b1442c8a3e
Successfully built kdmp-parser
Installing collected packages: kdmp-parser
Successfully installed kdmp-parser-0.5.0

(.venv) PS G:\git\python> python
Python 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import kdmp_parser
>>> import pathlib
>>> p = pathlib.Path(r"d:\tmp\system_dump.dmp")
>>> fdmp = kdmp_parser.KernelDumpParser(p)
>>> fdmp.type
<DumpType.FullDump: 1>
>>> dir(fdmp)
['_KernelDumpParser__dump', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'context', 'directory_table_base', 'filepath', 'read_physical_page', 'read_virtual_page', 'translate_virtual', 'type']
>>> hex(fdmp.directory_table_base)
'0xd99cd000'

Only thing to report atm is, I guess, that the readme could possibly get a bit of refresh since class names changed a bit. I'll continue to test in the upcoming days and report any problems.

hugsy commented 10 months ago

Only thing to report atm is, I guess, that the readme could possibly get a bit of refresh since class names changed a bit.

It's important, most people only read the README (if anything 😀) It should be correct.

I'll continue to test in the upcoming days and report any problems

Much appreciated

neitsa commented 10 months ago

Hi guys!

Not sure if this warrants a new issue or not. Let me know.

It's probably just a minor nitpick, but the code (haven't tested if it's from python or C++) fails if reading from a physical address that is in large page and is not aligned on a page boundary. If that doesn't makes sense, here's a simple example from a full kernel dump.

Note: testing with KdDebuggerDataBlock as it has an easily recognizable signature.

PS G:\reverse\utils\kdmp-parser> .\parser.exe -a D:\_tmp\system_dump.dmp
--------------------------------------------------------------------------------
Dump structures:
  HEADER64
    +0x0000: Signature                : 0x45474150.
    +0x0004: ValidDump                : 0x34365544.
    +0x0008: MajorVersion             : 0x0000000f.
    +0x000c: MinorVersion             : 0x000023f0.
    +0x0010: DirectoryTableBase       : 0x00000000d99cd002.
    +0x0018: PfnDatabase              : 0xfffff90000000000.
    +0x0020: PsLoadedModuleList       : 0xfffff8017fe2a190.
    +0x0028: PsActiveProcessHead      : 0xfffff8017fe1df80.
    +0x0030: MachineImageType         : 0x00008664.
    +0x0034: NumberProcessors         : 0x00000004.
    +0x0038: BugCheckCode             : 0x00000000.
    +0x0040: BugCheckCodeParameter
    +0x0080: KdDebuggerDataBlock      : 0xfffff8017fe00b20.

KdDebuggerDataBlock is at 0xfffff8017fe00b20.

Opening the dump in kd, translating (v to p), checking the content and the table entries:

0: kd> !vtop d99cd000 fffff8017fe00b20
Amd64VtoP: Virt fffff8017fe00b20, pagedir 00000000d99cd000
Amd64VtoP: PML4E 00000000d99cdf80
Amd64VtoP: PDPE 0000000004409028
Amd64VtoP: PDE 000000000440aff8
Amd64VtoP: Large page mapped phys 0000000003800b20
Virtual address fffff8017fe00b20 translates to physical address 3800b20.

0: kd> !db 3800b20
# 3800b20 70 05 e4 7f 01 f8 ff ff-70 05 e4 7f 01 f8 ff ff p.......p.......
# 3800b30 4b 44 42 47 80 03 00 00-00 00 20 7f 01 f8 ff ff KDBG...... .....
# 3800b40 70 f0 5f 7f 01 f8 ff ff-00 00 00 00 00 00 00 00 p._.............
# 3800b50 00 00 00 00 00 00 01 00-c0 a7 5f 7f 01 f8 ff ff .........._.....
# 3800b60 00 00 00 00 00 00 00 00-90 a1 e2 7f 01 f8 ff ff ................
# 3800b70 80 df e1 7f 01 f8 ff ff-d0 b5 ef 7f 01 f8 ff ff ................
# 3800b80 00 69 e1 7f 01 f8 ff ff-00 00 00 00 00 00 00 00 .i..............
# 3800b90 00 00 00 00 00 00 00 00-68 b4 ef 7f 01 f8 ff ff ........h.......

0: kd> !pte fffff801`7fe00b20
                                           VA fffff8017fe00b20
PXE at FFFFBADD6EB75F80    PPE at FFFFBADD6EBF0028    PDE at FFFFBADD7E005FF8    PTE at FFFFBAFC00BFF000
contains 0000000004409063  contains 000000000440A063  contains 8A000000038008E3  contains 0000000000000000
pfn 4409      ---DA--KWEV  pfn 440a      ---DA--KWEV  pfn 3800      --LDA--KWEV  LARGE PAGE pfn 3800  

0: kd> dt _mmpte_hardware FFFFBADD7E005FF8 Large*
nt!_MMPTE_HARDWARE
   +0x000 LargePage : 0y1

KdDebuggerDataBlock is somewhere in a data section of the kernel , mapped as a large 2MB (?) page as given by the PDE.

Simple python code:

import pathlib

import kdmp_parser

def main():
    file_path = pathlib.Path("d:/_tmp/system_dump.dmp")
    fdmp = kdmp_parser.KernelDumpParser(file_path)

    virt_addr = 0xfffff8017fe00b20
    phy_addr = fdmp.translate_virtual(virt_addr)
    if not phy_addr:
        print("[*] failed to get phy addr.")
        return
    print(f"[*] VIRT 0x{virt_addr:016x} --> PHY: 0x{phy_addr:016x}")

    content_virt = fdmp.read_virtual_page(virt_addr)
    if not content_virt:
        print('Failed to read virtual address.')

    # this fails
    content_phy = fdmp.read_physical_page(phy_addr)
    msg = "FAILED" if content_phy is None else "SUCCEEDED"
    print(f"[*] (unaligned phy) Reading from phy addr {msg}.")

    # this works
    content_phy = fdmp.read_physical_page(phy_addr & 0xfffffffffffff000)
    msg = "FAILED" if content_phy is None else "SUCCEEDED"
    print(f"[*] (aligned phy) Reading from phy addr {msg}.")

    print("[*] Exiting")        

if __name__ == "__main__":
    main()       

Output:

[*] VIRT 0xfffff8017fe00b20 --> PHY: 0x0000000003800b20
[*] (unaligned phy) Reading from phy addr FAILED.
[*] (aligned phy) Reading from phy addr SUCCEEDED.

Now, if I test the same code with an address somewhere in a 4KB page, it works (e.g. PfnDatabase + 0x20 converted to a physical address) even if the physical address is not aligned on a page boundary.

Not a big deal though, once you know it.

hugsy commented 10 months ago

Good catch!

It seems that C++ kdmpparser::GetPhysicalPage makes no assumption on the alignment of its address parameter, which may result in the page being not found if the parameter is not aligned.

https://github.com/0vercl0k/kdmp-parser/blob/3bec915e6f5304c187765be7ce3cfde713d7c29b/src/lib/kdmp-parser.h#L233-L246

A little everywhere in kdmp-parser GetPhysicalPage is called with Page::Align which you made me realized this part is not completely exposed to Python. So this is now fixed (see hugsy/kdmp-parser@71d28b9), thanks!

And as for non-alignment part (it's part of the C++ code) I'll let @0vercl0k answer on it as this is an implementation choice.

Cheers!

hugsy commented 10 months ago

Happy to fix more bugs/add more improvements that you spot. Don't be shy 😎

0vercl0k commented 10 months ago

+1 for README being cleaned up / reflecting accurate information. And I'll also check the alignment thing in the C++ part; I can't remember 😅

But yes, keep them coming Neitsa - thank a lot 🙏!

Cheers

On Sun, Nov 5, 2023 at 1:28 AM crazy hugsy @.***> wrote:

Happy to fix more bugs/add more improvements that you spot. Don't be shy 😎

— Reply to this email directly, view it on GitHub https://github.com/0vercl0k/kdmp-parser/issues/14#issuecomment-1793591712, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALIORKRGGYO67WNU7FVVK3YC3MS3AVCNFSM4VLEVYNKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGM2TSMJXGEZA . You are receiving this because you were mentioned.Message ID: @.***>

neitsa commented 10 months ago

Hey guys, I hope you're doing well :)

Nothing to report, I have been using the library lately without any problems.

Now that Santa is (soon to be) around the corner, a few stuff I'd like to see (should I make a new issue?).

Please note that I can try to implement them if you think some of the propositions mentioned below are worth it. It's more an open discussion than "Pretty please implement it, I'm too lazy to do anything" :p

Feature Requests

What I usually await from a parser is to give me access to the structural shape of the file format, not necessarily parsing the data contained inside those structures (i.e it's not the job of the dump parser to give me the list of processes or threads or whatever you might find inside a dump; I can use the primitive given by the parser to build another library on top). From my PoV, which may be limited by my use cases, the more information I have about the structural format, the better.

I think without the header's information there's not much to do. I guess that most use cases would benefit from having access to all the information contained in the header (e.g. kernel base, loaded module list, active process head, etc.). It's not that hard to re-implement the header parsing on top of the library, but since kdmp-parser already does it, having access to this header would be great. Possible access to other parts of the header (memory runs for ex.) would be great.

It's not complicated once you have kdmp-parser (and access to the header) to get the content of that structure (available from wdbgexts.h ). I seem to understand that it might be sometime encoded (from the various full dumps I gathered it was always in clear, though) and can be decoded.

I guess that if it was available from the get go it could be a real plus :) Most of the interesting info is already in the header proper, tough, so no big deal if you decide it's not worth the hassle.

Question

Following our previous discussion, what should a user asking for an address in a large page should be given?

Another possibility would be to give some help to the end-user with an additional function that would return the base address of the page / page table table entry / PFN from a physical address. Note sure if it's really helpful, though. I don't have a precise use case for that...

Possible Optimization

I mean, we all know the saying:

Premature optimization is the root of all evil -- Donald Knuth

Now, I was thinking (it happens!): the underlying C++ code is using the dump as a mapped file, and the kernel is doing a great job and not mapping the whole file in one go, but only really mapping the pages you touch in the mapped file.

The only problem is if you decide to actually scan the whole dump and touch all (or a big number of) pages; e.g. searching for something specific, like pool tags, or whatever pattern you could think of. You end up, as far I can surmise, with the whole file mapped in memory. Depending on the size of the dump, it can become a burden on the computer memory and you may not even have enough memory.

Since physical runs are quite limited in their number (i.e. PHYSMEM_DESC and PHYSMEM_RUN in the code;, something like around 80 of them at max) and are relatively easy to parse it could technically be possible to just open the file as a "regular" file (i.e not a mapped one) and translate physical address to flat file offset (or virt -> phy -> offset). On Windows it's possible to this in one go (and avoid a SetFilePointer syscall) with just a synchronous ReadFile and a OVERLAPPED structure.

I guess it would nonetheless be slower - no real idea of how much, though - than having a memory mapped file, but it could be a trade-off compared to the memory requirements of a (very?) big memory dump.

Possible implementation could be a base class from which (e.g.) KernelDumpMemoryParser and KernelDumpFileReader could both subclass, with a factory function to decide which one the end-user would like to use.

Or maybe just memory mapping the paging tables (although I guess they could be quite big too :| ) to do the translation, and do the actual page reading by reading the flat file.

Thanks a lot for reading this wall of text :D

hugsy commented 10 months ago

Yo!

Ability to access the header(s) from Python (at least the main header, HEADER64).

I think that's exactly 2 lines to add. Should be doable. Maybe after #19 is reviewed and (hopefully) merged.

I can't help for the rest, I'm not the chief engineer here 😁 But maybe open different issues on Github for your suggestions, it'll make the tracking easier.

0vercl0k commented 10 months ago

Thanks for the feedback neitsa - it makes a lot of sense.

Maybe let's open a issues then:

What do you think?

Regarding the question; I guess the ideal case is we give what the user asks for (I think the Read API kinda does this by taking in a size argument). But maybe it's a good time to revisit the interface a bit and how it works to make it useful for everybody.

For history, I made this library for https://github.com/0vercl0k/wtf and the only thing it needed was getting access to the physical memory addresses & the page contents; so that's kinda why the API is probably not very well suited for anything else 😅 Happy to improve it though, so maybe another issue for that as well?

Cheers

On Fri, Nov 10, 2023 at 6:36 AM crazy hugsy @.***> wrote:

Yo!

Ability to access the header(s) from Python (at least the main header, HEADER64).

I think that's exactly 2 lines to add. Should be doable. Maybe after #19 https://github.com/0vercl0k/kdmp-parser/pull/19 is reviewed and (hopefully) merged.

I can't help for the rest, I'm not the chief engineer here 😁 But maybe open different issues on Github for your suggestions, it'll make the tracking easier.

— Reply to this email directly, view it on GitHub https://github.com/0vercl0k/kdmp-parser/issues/14#issuecomment-1805128904, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALIORMLQHLVCYSXVOJL773YDW4NDAVCNFSM4VLEVYNKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQGUYTEOBZGA2A . You are receiving this because you were mentioned.Message ID: @.***>

hugsy commented 10 months ago

the only thing it needed was getting access to the physical memory addresses & the page contents

TBF that's the only thing you really need when you think about it. The rest is sugar coating 😀

0vercl0k commented 8 months ago

Oops, merging the PR closed this but there's still some work to be done and I still owe you guys some answers; I haven't forgotten :).

I'll be re-reading this thread and create new issues to get answers and experiment!

Cheers