SerenityOS / serenity

The Serenity Operating System 🐞
https://serenityos.org
BSD 2-Clause "Simplified" License
30.79k stars 3.18k forks source link

Kernel: Implement `madvise`/`posix_fadvise` #18579

Open Hendiadyoin1 opened 1 year ago

Hendiadyoin1 commented 1 year ago

madvise allows a userspace programm to tell the kernel, how it intends to access mapped memory.

Currently we only load mapped files lazily one page at a time, when said page is accessed, which slows down serial IO intensive workloads like multimedia decoders. For such applications it might be useful to support MADV_SEQUENTIAL to increase the loaded pages at a time, or even MADV_WILLNEED to start an early (asynchronous) load of the mapped data

Also linux' MADV_FREE might be helpful to allow the kernel to for example reclaim parts of the heap that are no longer needed

Additionally MADV_SEQUENTIAL can be used as a hint to the kernel during memory pressure to prefer early pages of mapped files during eviction

Man page

Similar goes for the more posixy posix_fadvise

Panky-codes commented 1 year ago

*posix_madvice instead of madvice.

So a prerequisite for this would be to have readahead infrastructure right? I assume we don't have it yet.

Hendiadyoin1 commented 1 year ago

No we don't have any read-ahrad infrastructure yet, it would likely mainly entail creating an async read request and a way of propagating and checking for partial success of it.

This is how I imagine this could work, but might not necessarily be the ideal way of doing it, and definitely is not the easiest/least invasive way, So take it with a grain of salt

Like when we request to access a page soon, we start a request for the next few pages and don't block on it. When then accessing a page that is in that mapping, we would look for the request containing it and checking the request for partial completion, ie if that page is already present, or interrogate the corresponding device for such a state, and of that is not the case we should allow waiting for such a partial success. If we run over the premapped memory we would go through a similar process and just block for the first page, and then let the rest run in the background

Panky-codes commented 1 year ago

IIUC, we don't have a generic page cache like in Linux but rather we have a buffer cache per block device-based FS. So we could do something similar to what you suggest, if this flag is set, whenever we do a read, we will do read + read ahead and only block on the read that was requested. Hopefully, when the application asks for next the block, it might already be in the buffer cache instead of doing an explicit IO.

One problem I see is the limitation we have in the block layer where we can do only a max of 4k blocking IO in most drivers. The real benefit of RA will come if the block device can do bigger IO than the requested IO size and when the application does the next seq. read, we have it in the buffer cache. So I don't see having a real benefit of madvise increasing the performance until we fix the bottleneck in the block layer. I have some ideas there but it will definitely take some time.

Hendiadyoin1 commented 1 year ago

The FS layer only reads one (fs) block at a time (1 page by "coincidence") but only ever gets requested to do 1 page aswell

Panky-codes commented 1 year ago

The FS layer only reads one (fs) block at a time (1 page by "coincidence") but only ever gets requested to do 1 page aswell

Yeah. But if we have RA in place, FS layer can still request 1 block at a time, but the block layer should send a bigger request (ofc keeping the EOF into account), and have the adjacent blocks in the block cache before the FS does an explicit request for the next block.