StephanvanSchaik / mmap-rs

A cross-platform and safe Rust API to create and manage memory mappings in the virtual address space of the calling process.
Apache License 2.0
59 stars 17 forks source link

Support for madvise()-based memory mapping #8

Closed vigna closed 1 year ago

vigna commented 1 year ago

The current code tries to use huge pages directly into mmap(). On many current Linux systems the preferred way is to use madvise(). For example,

p = mmap(NULL, n * sizeof *p, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
madvise(p, n * sizeof *p, MADV_HUGEPAGE);

This approach lets the system choose the page size; moreover, it is usually possible through /proc to ignore the advice (e.g., for systems experimenting excessive fragmentation).

I can try a PR to implement this if you're interested. I see two possible approaches:

StephanvanSchaik commented 1 year ago

The current code tries to use huge pages directly into mmap(). On many current Linux systems the preferred way is to use madvise().

There are two different interfaces, though. The one you are describing is called Transparent Huge Pages (THP), which involves running the khugepaged kernel thread periodically to scan for eligible areas to be replaced with huge pages. One of the modes for THP is madvise mode, in which a program needs to use madvise() with MADV_HUGEPAGE flag to give the kernel a hint that a particular region is eligible to be considered for transparent huge pages. It, however, does not guarantee that you will actually get huge pages for that region.

With THP set to always, it mostly serves as an optimization technique for the entire system to reduce TLB trashing, and you wouldn't have to use the madvise system call.

When using mmap with MAP_HUGETLB you are guaranteed to get huge pages from a pool of reserved huge pages. If no huge pages are available, the kernel will not try to use a different page size unless you ask it to. This use case is very common for virtualization where you want to be sure you are actually getting 2M or 1G huge pages, because page size has a huge impact on performance when virtualization is involved due to the costly page table walks.

This approach lets the system choose the page size; moreover, it is usually possible through /proc to ignore the advice (e.g., for systems experimenting excessive fragmentation).

The above basically means that this is not a replacement for explicitly requesting huge pages and having a way to tell you actually got them, and I think the API should reflect that because it is paramount for virtualization. I think we can still support the madvise system call to mark regions as eligible or not eligible for THP in two ways:

  • If someone specifies HUGE_PAGES but no page size, then madvise() is used in lieu of a mmap() flag.

I think this behavior is too implicit, and that huge pages should really mean (the explicit variant of) huge pages.

vigna commented 1 year ago

I completely agree with everything you write. As I wrote in the other issue, I was trying to replicate my C/C++ setup in a way that was as a transparent as possible for me. I work on bare metal most of the time so I have a fine control on the OS and THPs really works for me.

Would you be interested in a PR that adds a flag TRANSPARENT_HUGE_PAGES? I have implemented the implicit behavior above but I agree—it's not a good idea. It could also be called MADV_HUGE_PAGES—maybe it's even more precise this way.

As I side note: I got stuck for a couple of hours because I was calling with_flags() twice assuming it was a logical OR, whereas it is a silent replacement. Maybe adding to map_anon.rs a flag in OR would solve the problem—newbies like me would understand immediately how to use the flags from the example. Alternatively, with_flags() could do an OR.

StephanvanSchaik commented 1 year ago

Would you be interested in a PR that adds a flag TRANSPARENT_HUGE_PAGES?

Yes, that would be much appreciated actually.

It could also be called MADV_HUGE_PAGES—maybe it's even more precise this way.

I feel like MADV_HUGE_PAGES doesn't really hint to me that this is about the THP mechanism (e.g. I had to open the man page for madvise because I forgot THP has an madvise mode where this makes sense). Having an explicit name is usually a good idea, because it helps point the user into the right direction when looking for documentation.

As I side note: I got stuck for a couple of hours because I was calling with_flags() twice assuming it was a logical OR, whereas it is a silent replacement. Maybe adding to map_anon.rs a flag in OR would solve the problem—newbies like me would understand immediately how to use the flags from the example. Alternatively, with_flags() could do an OR.

I agree, changing with_flags() to simply append the flags instead of overriding them probably makes more sense. Thanks for pointing this out!