Apple-like MADV_FREE_REUSE / MADV_FREE_REUSABLE

Apple provides two madvise flags that are similar to MADV_FREE:

MADV_FREE_REUSABLE is similar to MADV_FREE in that the kernel may take pages away at will but, unlike MADV_FREE, they do not have to be replaced until the corresponding MADV_FREE_REUSE call.
MADV_FREE_REUSE undoes MADV_FREE_REUSABLE and tells the kernel that the pages must from now on have stable contents.

There are two big benefits to these over MADV_FREE:

The kernel receives an explicit signal that pages are moved out of this state (with MADV_FREE it must collect the dirty bit to determine whether a page has been written to), which means that it can adjust RSS accurately and can enforce RSS limits.
The kernel doesn't need to synchronise dirty bit state, it is undefined for userspace to access these pages (it is valid for the kernel to update page tables to specify no-userspace-access)

It's very easy to support this in snmalloc, which we're using for the revocation work and this interface has some extra value in the context of revocation. First, because the revoker can (at the start of scan) mark all of these pages as no-access for userspace and then not bother to scan them. Second, because we can combine it with zeroing behaviour such that any pages returned to userspace after MADV_FREE_REUSABLE are guaranteed to be zeroed (if they have been reclaimed, by replacing them with CoW copies of the zero page, if they have not been reclaimed then by having the kernel zero the pages in a low-priority thread and eagerly zeroing any that are still on the to-zero list in the madvise call that returns them). This gives us a simple interface that the allocator can use to guarantee that all heap allocations are zeroed.

@markjdb / @bsdjhb, do you have thoughts on this? Most of this would be useful in upstream FreeBSD, though some bits are CheriBSD specific.

I'm having trouble finding any documentation of this interface, so I'll try re-explaining it based on my understanding of what's written above.

Suppose userspace calls madvise(MADV_FREE_REUSABLE) on a virtual address range backed by anonymous memory. Then any physical pages mapped in that range enter a "reusable" state. When in this state, pages:

don't count towards the process RSS
can be reclaimed without paging out their contents and without clearing dirty bits As with MADV_FREE, they can remain mapped indefinitely and might be lazily zeroed by the kernel.

Suppose userspace then calls madvise(MADV_FREE_REUSE) on the same range. Any physical pages still mapped by the range exit the "reusable" state and are zeroed before the system call returns (this could be done lazily or inline). At no point are page tables updated (except as part of reclamation). Is this more or less accurate? I don't quite understand the statement, "unlike MADV_FREE, they do not have to be replaced until the corresponding MADV_FREE_REUSE call."

I have a few questions:

Do we want to clear the dirty bit from mappings of a zeroed page (i.e., a mapping that persists throughout the mapped page's "reusable" state)? I suppose it's not strictly necessary.
How does this behave with shared mappings? Do we want to permit it at all? I looked at XNU a bit and the implementation seems to allow it (see vm_map_entry_is_reusable()) but your use case is centered around anonymous memory allocators.
Should we reuse Apple's names? I'd be a bit worried about software assuming that we have identical semantics when that's hard to guarantee.
Suppose I have a range in the "reusable" state and then fork. What happens when the child reads or writes to a resident page in the range? How do we zero a page that's mapped COW?

As far as the implementation goes, I think we'd want a new VPO_* flag, synchronized by the VM object lock, to indicate that a physical page is in the "reusable" state. The page daemon can cheaply reclaim such pages. madvise(MADV_FREE_REUSABLE) sets that flag on any resident pages in the range, and madvise(MADV_FREE_REUSE) clears it and handles zero'ing, using the page busy lock and object lock to interlock. Handling COW is probably the hardest part. Note also that we currently ignore MADV_FREE in some scenarios related to COW to work around the "rewind-on-fork" bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240061 . The workaround added for that bug is not ideal and should be revisited as a part of this.

Suppose userspace calls madvise(MADV_FREE_REUSABLE) on a virtual address range backed by anonymous memory. Then any physical pages mapped in that range enter a "reusable" state. When in this state, pages:
* don't count towards the process RSS

* can be reclaimed without paging out their contents and without clearing dirty bits
  As with MADV_FREE, they can remain mapped indefinitely and might be lazily zeroed by the kernel.

That agrees with my understanding.

It's important to add that faults into a REUSABLE segment of the address space are permitted to be fatal: we're not obligated to swing a zeroed page into place to catch a fault (read or write!) until we've been told that the address space is up for REUSE.

Suppose userspace then calls madvise(MADV_FREE_REUSE) on the same range. Any physical pages still mapped by the range exit the "reusable" state and are zeroed before the system call returns (this could be done lazily or inline).

I believe the "and are zeroed" is novel here and is not part of Apple's semantics, but otherwise yes.

At no point are page tables updated (except as part of reclamation). Is this more or less accurate?

It agrees with my understanding, which may or may not be a point in its favour. ;)

I have a few questions:

* Do we want to clear the dirty bit from mappings of a zeroed page (i.e., a mapping that persists throughout the mapped page's "reusable" state)? I suppose it's not strictly necessary.

For anonymous memory mappings pushed into REUSABLE/REUSE, it may be worth clearing the pmap dirty bits so that zeroed pages are seen as clean? It might even be worth tracking zeroed-and-clean pages as such even after they make the REUSE transition so that they don't need to be re-zeroed on their next trip through REUSABLE/REUSE or if reclaimed.

* How does this behave with shared mappings? Do we want to permit it at all? I looked at XNU a bit and the implementation seems to allow it (see vm_map_entry_is_reusable()) but your use case is centered around anonymous memory allocators.

For shared mappings, I think it's arguable that the "advise" applies to the underlying VM object. Such a thing would be useful for allocators whose heaps straddle address spaces as part of process-based sandboxing, for example.

For shadowing anonymous mappings, as you note below, there are challenges with rewinding time. It may make sense to take MADV_FREE_REUSABLE as sufficient grounds to punch out a new anonymous mapping (there's no point in collapsing the shadow chain, just replace the region of the mapping). For shadowing named mappings... I think -EINVAL might be fine.

* Should we reuse Apple's names? I'd be a bit worried about software assuming that we have identical semantics when that's hard to guarantee.

I find the names kind of suspect, so I might (very softly) push for different names. Perhaps MADV_FREE_HOLE (REUSABLE) and MADV_FREE_ZERO (REUSE). If you haven't done the first before you do the second it's implicitly and immediately done for you?

* Suppose I have a range in the "reusable" state and then fork. What happens when the child reads or writes to a resident page in the range? How do we zero a page that's mapped COW?

I believe fork() should also be taken as an opportunity to do the kind of entry hole punching for shadowing entries as above, though I don't know what to do about the case I suggested could be -EINVAL above, so perhaps it can't be -EINVAL after all.

As far as the implementation goes, I think we'd want a new VPO_* flag, synchronized by the VM object lock, to indicate that a physical page is in the "reusable" state. The page daemon can cheaply reclaim such pages. madvise(MADV_FREE_REUSABLE) sets that flag on any resident pages in the range, and madvise(MADV_FREE_REUSE) clears it and handles zero'ing, using the page busy lock and object lock to interlock. Handling COW is probably the hardest part.

Given that we also want to make fatal any faults in regions in the REUSABLE state, I think we also may need to cut up map entries, so I think this may (will?) require taking the map write lock. If we have to do that, does that change the above suggestion?

Note also that we currently ignore MADV_FREE in some scenarios related to COW to work around the "rewind-on-fork" bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240061 . The workaround added for that bug is not ideal and should be revisited as a part of this.

Agreed, though if we have these better operators I think my particular desire for MADV_FREE goes away.

Suppose userspace calls madvise(MADV_FREE_REUSABLE) on a virtual address range backed by anonymous memory. Then any physical pages mapped in that range enter a "reusable" state. When in this state, pages:
* don't count towards the process RSS

* can be reclaimed without paging out their contents and without clearing dirty bits
  As with MADV_FREE, they can remain mapped indefinitely and might be lazily zeroed by the kernel.
That agrees with my understanding.

It's important to add that faults into a REUSABLE segment of the address space are permitted to be fatal: we're not obligated to swing a zeroed page into place to catch a fault (read or write!) until we've been told that the address space is up for REUSE.

Hmm. That might be useful for (userspace) debugging purposes but otherwise complicates the implementation without providing a consistent guarantee: "most" of the time any resident pages will remain mapped and we won't catch reads or writes unless page tables are modified, but the overhead of such modifications is part of the motivation for this mechanism.

Suppose userspace then calls madvise(MADV_FREE_REUSE) on the same range. Any physical pages still mapped by the range exit the "reusable" state and are zeroed before the system call returns (this could be done lazily or inline).

I believe the "and are zeroed" is novel here and is not part of Apple's semantics, but otherwise yes.

Oh, ok. I tried to read Apple's implementation but it's not exactly straightforward. :)

At no point are page tables updated (except as part of reclamation). Is this more or less accurate?

It agrees with my understanding, which may or may not be a point in its favour. ;)
I have a few questions:
* Do we want to clear the dirty bit from mappings of a zeroed page (i.e., a mapping that persists throughout the mapped page's "reusable" state)? I suppose it's not strictly necessary.
For anonymous memory mappings pushed into REUSABLE/REUSE, it may be worth clearing the pmap dirty bits so that zeroed pages are seen as clean? It might even be worth tracking zeroed-and-clean pages as such even after they make the REUSE transition so that they don't need to be re-zeroed on their next trip through REUSABLE/REUSE or if reclaimed.

I'd be inclined to do as you suggest, if only so that madvise(MADV_FREE_REUSE) consistently returns clean, zeroed pages. But if the overhead of clearing dirty bits is somewhat we want to avoid, then it might be ok to live without that.

* How does this behave with shared mappings? Do we want to permit it at all? I looked at XNU a bit and the implementation seems to allow it (see vm_map_entry_is_reusable()) but your use case is centered around anonymous memory allocators.
For shared mappings, I think it's arguable that the "advise" applies to the underlying VM object. Such a thing would be useful for allocators whose heaps straddle address spaces as part of process-based sandboxing, for example.

For shadowing anonymous mappings, as you note below, there are challenges with rewinding time. It may make sense to take MADV_FREE_REUSABLE as sufficient grounds to punch out a new anonymous mapping (there's no point in collapsing the shadow chain, just replace the region of the mapping). For shadowing named mappings... I think -EINVAL might be fine.

That seems sensible. I think I would want to also change MADV_FREE to punch out a new anonymous mapping, so as to provide a proper fix for the rewind-on-fork bug.

* Should we reuse Apple's names? I'd be a bit worried about software assuming that we have identical semantics when that's hard to guarantee.
I find the names kind of suspect, so I might (very softly) push for different names. Perhaps MADV_FREE_HOLE (REUSABLE) and MADV_FREE_ZERO (REUSE). If you haven't done the first before you do the second it's implicitly and immediately done for you?

That sounds reasonable to me. I don't really like Apple's names either.

* Suppose I have a range in the "reusable" state and then fork. What happens when the child reads or writes to a resident page in the range? How do we zero a page that's mapped COW?
I believe fork() should also be taken as an opportunity to do the kind of entry hole punching for shadowing entries as above, though I don't know what to do about the case I suggested could be -EINVAL above, so perhaps it can't be -EINVAL after all.

As far as the implementation goes, I think we'd want a new VPO_* flag, synchronized by the VM object lock, to indicate that a physical page is in the "reusable" state. The page daemon can cheaply reclaim such pages. madvise(MADV_FREE_REUSABLE) sets that flag on any resident pages in the range, and madvise(MADV_FREE_REUSE) clears it and handles zero'ing, using the page busy lock and object lock to interlock. Handling COW is probably the hardest part.

Given that we also want to make fatal any faults in regions in the REUSABLE state, I think we also may need to cut up map entries, so I think this may (will?) require taking the map write lock. If we have to do that, does that change the above suggestion?

It complicates things a bit, but not greatly. Again though I wonder how useful it is to make faults fatal when accesses to already-mapped pages in the REUSABLE state will silently succeed.

Note also that we currently ignore MADV_FREE in some scenarios related to COW to work around the "rewind-on-fork" bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240061 . The workaround added for that bug is not ideal and should be revisited as a part of this.

Agreed, though if we have these better operators I think my particular desire for MADV_FREE goes away.

I could be convinced either way on faulting in a REUSABLE region, I guess. My primary concern is that that the mechanism compose well with CHERI revocation in that if a region is made REUSABLE and marked for revocation and flagged for REUSE only after a round of revocation has taken place that it will certainly be full of zeros regardless of any UAFs up to revocation and regardless of how the kernel does its zeroing of pages in the background vs. foreground. I think, having given it but a few moment's thought, that preventing pages from appearing due to faults in the REUSABLE state means that if the kernel has zeroed all pages in the background that the REUSE transition is O(1) whereas if faults are resolved as normal then the REUSE transition must look at each page's metadata and zero any pages dirtied as a result of stores "behind" the background zero.

I agree with @nwf's comments, with one proviso:

The requirement for zeroing in revocation is that everything in a reusable region is zeroed and not userspace writeable at the start of revocation. This is already a global serialisation point for the process: we need to kick all system calls out of the kernel and, if they're not in the kernel already, then we're just about to make them trap as soon as they try to load a pointer from their stacks.

I think (based on no experimental evidence at all to see what the impact this has on revocation pause times is) that it would be fine to postpone any page-table updates until the point where we start revocation and then do a bulk operation to discard all of the physical pages backing these reservations. On the fast path of free, we don't touch this. On the slightly slower path, if we're returning entire pages to the back end, the madvise would be fast and would allow the kernel to reclaim pages if necessary. Beginning revocation would be more expensive but is hopefully an infrequent operation in comparison to free.

Note also that in snmalloc, if you enable POSIX commit checks, then it will mprotect the region with PROT_NONE in the same place where we do the madvise, so for debugging purposes we could turn that on to get precise traps if you try to access reuseable memory.

Bike shedding the name. As it is replacing with zeros, I wonder if this should really be named as a variant of MADV_DONTNEED? It really feels like DONTNEED split into a start and end call. Perhaps,

MADV_DONTNEED_LAZY - for starting the async operation.
MADV_DONEED or perhaps MADV_DONTNEED_LAZY_FINISHED - for waiting for the async operation to complete.

I would not name this as a variant of DONTNEED. DONTNEED doesn't discard data in dirty pages (those pages are still flushed to backing store if you have a MAP_SHARED mapping of a file, and I think the pages are just left alone if they are dirty pages of a MAP_PRIVATE mapping), whereas MADV_FREE does mean it's ok to discard data in dirty pages. MADV_FREE_HOLE and MADV_FREE_ZERO do make sense to me. Perhaps though the first one could be named something like MADV_FREE_LAZY_ZERO to communicate that the two are linked and not really independent? (An open question is if the two operations are inherently linked, or if it is only snmalloc's use case that requires the two to be linked and if they might otherwise be used independently in other use cases?)

[ Writing up some discussion offline with @nwf ]

It would be good (on pre-Milan x86, where there's no broadcast TLB invalidate) if we could defer that so that:

On the first madvise call, we mark the page range as userspace no-access and record the range that needing invalidating in the TLBs.
On context switch on other cores, we do the page invalidate (I believe the x86 pmap has some code for doing this already?).
Once all cores have invalidated their TLBs, we make the pages available for the background zeroing thread.
On the second madvise call, we are in one of three states:
- If the background thread has zeroed everything, mark the pages are read-write and return immediately (IPI the other cores that may need to INVLPG to put them back in the read-write state, this can be deferred for any core that isn't using the pages).
- If the background thread is in process of running, work-steal from it until everything is zeroed and return.
- If the background thread hasn't started, remove the pages from the to-invalidate list, zero them, and IPI any cores that are running with a view of this memory and have invalidated this page.

In the CHERI case, the second madvise call will not happen until after the pages have been moved from the quarantine list, which removes a lot of the cases from step 4.

The main difference between this and an unmap is that we expect that we will reuse the virtual address space for the same kind of memory in the relatively near future and so we don't want to be faulting everything in lazily unless the system is in a low-memory state and needs to snaffle some physical memory back.

CTSRD-CHERI / cheribsd

Apple-like MADV_FREE_REUSE / MADV_FREE_REUSABLE #1318