Open Hendiadyoin1 opened 1 year ago
*posix_madvice
instead of madvice
.
So a prerequisite for this would be to have readahead infrastructure right? I assume we don't have it yet.
No we don't have any read-ahrad infrastructure yet, it would likely mainly entail creating an async read request and a way of propagating and checking for partial success of it.
This is how I imagine this could work, but might not necessarily be the ideal way of doing it, and definitely is not the easiest/least invasive way, So take it with a grain of salt
Like when we request to access a page soon, we start a request for the next few pages and don't block on it. When then accessing a page that is in that mapping, we would look for the request containing it and checking the request for partial completion, ie if that page is already present, or interrogate the corresponding device for such a state, and of that is not the case we should allow waiting for such a partial success. If we run over the premapped memory we would go through a similar process and just block for the first page, and then let the rest run in the background
IIUC, we don't have a generic page cache like in Linux but rather we have a buffer cache per block device-based FS. So we could do something similar to what you suggest, if this flag is set, whenever we do a read, we will do read + read ahead and only block on the read that was requested. Hopefully, when the application asks for next the block, it might already be in the buffer cache instead of doing an explicit IO.
One problem I see is the limitation we have in the block layer where we can do only a max of 4k blocking IO in most drivers. The real benefit of RA will come if the block device can do bigger IO than the requested IO size and when the application does the next seq. read, we have it in the buffer cache. So I don't see having a real benefit of madvise
increasing the performance until we fix the bottleneck in the block layer. I have some ideas there but it will definitely take some time.
The FS layer only reads one (fs) block at a time (1 page by "coincidence") but only ever gets requested to do 1 page aswell
The FS layer only reads one (fs) block at a time (1 page by "coincidence") but only ever gets requested to do 1 page aswell
Yeah. But if we have RA in place, FS layer can still request 1 block at a time, but the block layer should send a bigger request (ofc keeping the EOF into account), and have the adjacent blocks in the block cache before the FS does an explicit request for the next block.
madvise
allows a userspace programm to tell the kernel, how it intends to access mapped memory.Currently we only load mapped files lazily one page at a time, when said page is accessed, which slows down serial IO intensive workloads like multimedia decoders. For such applications it might be useful to support
MADV_SEQUENTIAL
to increase the loaded pages at a time, or evenMADV_WILLNEED
to start an early (asynchronous) load of the mapped dataAlso linux'
MADV_FREE
might be helpful to allow the kernel to for example reclaim parts of the heap that are no longer neededAdditionally
MADV_SEQUENTIAL
can be used as a hint to the kernel during memory pressure to prefer early pages of mapped files during evictionMan page
Similar goes for the more posixy
posix_fadvise