[Feature Request]: File-backed memory mappings rather than normal file I/O

Cerrseien commented 2 years ago

Description

Now that PCSX2 has made the jump to x64 and has access to more than just 4 GiB of address space - is there a reason why file-backed memory mappings (CreateFileMapping/mmap) are not being used for ISO files, but that instead blocks are still being read via ReadFile/read? It made sense when ISOs were so large that they couldn't possibly fit in the address space together with the rest of the process, but what's the rationale now?

As far as I understand it file-backed memory mappings -only require context switches if a page is cold and needs to be loaded -are managed by the operating system, which knows best which pages are hot and which are not -reduces complexity of the overall code -could even be used for better TLB utilisation, since modern CPUs come with dedicated translation buffers for 2 MiB and/or 1 GiB pages (a +4 GiB ISO can comfortably sit in +4 1-GiB entries rather than +1,048,576 4-KiB pages)

Reason

Since file I/O can easily be a bottleneck, and memory is fast and cheap these days, I'd like to ask what the plans are in this department.

Examples

It's an internal thing?

stenzek commented 2 years ago

Doesn't really help complexity, because you still need the read/decompress paths for compressed games.

It'll also end up slower, because we do asynchronous threaded readahead with the current system, if you just mmap'ed it, it'll end up blocking the EE thread when it faults.

Cerrseien commented 2 years ago

Is the decompression done in hardware on the original PS2, or by the EE? Because if it's the EE I don't see things getting more complicated other than the usual opcode translation. And if it's done in hardware it's its own special case anyway.

To prevent faults you could use VirtualLock? It could be an alternate system until people regularly have +16 GiB of memory available (which I do). EDIT: Just noticed there's _mm_prefetch, wouldn't that do the trick too?

stenzek commented 2 years ago

Is the decompression done in hardware on the original PS2, or by the EE?

I'm talking about compressed game dumps (chd/cso/etc).

EDIT: Just noticed there's _mm_prefetch, wouldn't that do the trick too?

No, unless you prefetched on the second thread and faulted there. And that's adding a lot of complexity, for something which is already just as efficient by using standard file I/O.

To prevent faults you could use VirtualLock? It could be an alternate system until people regularly have +16 GiB of memory available (which I do).

I still see zero benefit in preloading images to memory, there's usually sufficient time between issuing the I/O request and it coming back before the simulated disc seek completes.

Cerrseien commented 2 years ago

No, unless you prefetched on the second thread and faulted there

Is that because PREFETCH doesn't complete its own execution if that would cause any fault, including a (successful) page fault?

stenzek commented 2 years ago

The prefetch instruction only fetches from main RAM into the cache hierarchy. If the pages prefetched aren't currently in physical memory, I don't believe it will fault, effectively being a noop. You'd have to manually touch a byte on each page on the prefetch thread to ensure the file's actually been read into RAM, which was what I meant by "prefetching on the second thread and faulting".

Cerrseien commented 2 years ago

Yes, that's what I meant by "doesn't complete its own execution". The only possible alternative I've found would've been PrefetchVirtualMemory, which, according to documentation, "will efficiently bring in those address ranges from disk using large, concurrent I/O requests where possible". Basically a overlapped ReadFile for file-backed mappings.

stenzek commented 2 years ago

Which is doing the same thing as our threaded/Async reads, just with less complexity and works with compressed formats (you would have to call that function on a worker thread because otherwise it would likely block while it reads from risk).

We're not doing tons of little accesses all over the place, and this is typically where mmap shines. They're nice big blocks, well suited to traditional file I/O (or overlapped/Async in our case).

If you want to implement this, go ahead, but it's not something I can see any value in doing.

Cerrseien commented 2 years ago

I would absolutely love to, but my builds throw exceptions in the EE Core thread as soon as the emulator gets past the BIOS screen, and since the callstack's a mess I can't debug it. pcsx2_exec , with the last entry in the log being

EE DECI2 Manager version 0.06 Mar 19 2002 18:11:29
CPUID=2e20, BoardID=0, ROMGEN=2002-0319, 32M

RegisterLibraryEntries:   sifcmd version 1.01
RegisterIntrHandler: intr INT_dmaSIF1, handler 183c0

Other than that there's already a couple things I've noticed in FlatFileReaderWindows.cpp that could be done differently ever so slightly, but it's really no use if I can't even run my own compiled code. The automated builds from orphis run just fine - or at least get me to the widescreen selection screen.

refractionpcsx2 commented 2 years ago

Just continue, the EE JIT throws exceptions for self modifying code.

Cerrseien commented 2 years ago

Thank you for the notice, but shouldn't this be in https://wiki.pcsx2.net/PCSX2_Documentation/Compiling_on_Windows? Also how do I disable those exceptions? I really don't feel like constantly F5ing my way through the emulator.

refractionpcsx2 commented 2 years ago

It's not noted no, because it's not really a compiling instruction.

There should be an area in visual studio that lets you disable breaking exceptions of a specific type, I can't remember how to do it so you might need to use some Google Fu.

F0bes commented 2 years ago

Unchecking this guy usually does the trick for me.

RedDevilus commented 2 years ago

Unchecking this guy usually does the trick for me.

Yeah, only need to do that once if it crashes.

Cerrseien commented 2 years ago

@stenzek

They're nice big blocks, well suited to traditional file I/O (or overlapped/Async in our case).

Err ...

It gets a little bit better once synced reads aren't forced anymore, but it's still well below 2 MiB (the smallest hugepage): overlapped

But it's still a bunch of useless context switches and TLB misses.

stenzek commented 2 years ago

Yes, each block is 2K in size (for ISOs, compressed formats are larger), and sequential access is more common than random access.

Cerrseien commented 2 years ago

OK, so after some extensive research I've found out a couple things:

As of this moment there is no way a file-backed mapping can be created that uses either SEC_LARGE_PAGES (2 MiB) or SEC_HUGE_PAGES (1 GiB). Even though these constants are defined in the lastest SDKs, CreateFileMapping2 always returns ERROR_INVALID_PARAMETER due to the fact that for stupid reasons large/huge pages need to be locked into memory and cannot be backed by the pagefile, and file mappings always need to be backed by the pagefile.
Some sources out there claim that VirtualAlloc2 can be used to allocate large/huge pages, but at least I have been failing miserably on my system (Win10 21H2). However, the somewhat undocumented function NtAllocateVirtualMemoryEx appears to be working marvelously, although the function needs a proper prototype:

WINBASEAPI NTSTATUS WINAPI NtAllocateVirtualMemoryEx
(
    HANDLE ProcessHandle,
    PVOID* BaseAddress,
    PSIZE_T RegionSize,
    ULONG AllocationType,
    ULONG PageProtection,
    PMEM_EXTENDED_PARAMETER ExtendedParameters,
    ULONG ExtendedParameterCount
);

as well as a static binding by adding ntdll.lib in the linker inputs. I've looked into the current linker input dependencies of PCSX2 and noticed that this isn't the case as of now.

The pros: -much faster response times (even SSDs are two orders of magnitude slower than RAM; latency is important) -much better TLB utilisation (ISOs can easily take several GiBs worth of memory, and each GiB is 262,144 4-KiB pages) -no unnecessary copies between kernel and user space in the middle of gameplay, but only once at startup -page faults literally cannot occur because there's no pagefile, so readahead/prefetching isn't required

The cons: -only sustainable for people with enough RAM (minimum 16 GiB, better is +32) -CPU caches might suffer because data is no longer written into the same buffer (but should be negligible since I/O latency is much worse) -requires group policy editing, which is something not every user is familiar with despite there being guides

The one point I'm uncertain about are compressed images - do my findings change anything with the initial assessment?

I propose the following logic:

-add the prototype for NtAllocateVirtualMemoryEx -add ntdll.lib as dependency -check if SeLockMemoryPrivilege (the privilege required for larger page sizes) can be enabled; if not, use the old code path -if file is larger than 1 MiB: use large pages; if file is larger than 512 MiB: use huge pages -check if memory mapping can be acquired; if not, use the old code path -copy file into our mapping whole

stenzek commented 2 years ago

Have you compared the actual performance? All those pros are great in theory, but my intuition is it won't make a measurable difference in practice (disc reads are a fraction of the CPU time of the EE thread), and like I said originally, isn't worth the added complexity.

For one, reading the entire disc image in at startup is definitely a downside, especially for those who don't have high speed NVMe SSDs. And if you do, they're probably not super large, so you'd be better off with compressed dump formats.

Cerrseien commented 2 years ago

How do you propose I measure the differences? Also, with a little bit of luck the file will still be in the buffer cache.

I can only speak for myself, but I'd rather have a larger startup time than random lags during gameplay. It's annoying especially for action RPGs and QTEs (which I have had my fair share of experiences with, like in KH2) - and since large/huge pages require this very specific group policy I'd argue that people who set it really, actually want it.

stenzek commented 2 years ago

How do you propose I measure the differences? Also, with a little bit of luck the file will still be in the buffer cache.

I'd say look at maximum FPS during an I/O intensive part of the game, but it's more likely to be bottlenecked by the other emulation components. Could log frame times I guess. My point is, how can we justify adding all this complexity if there's no proof that it actually makes any difference?

I can only speak for myself, but I'd rather have a larger startup time than random lags during gameplay.

Is there actually any evidence for this happening? And that file I/O is the cause? There's plenty of things which can cause frame time spikes, you'll likely find it's caused by JIT compilation, shaders, power states changing, etc, rather than disc access, as the emulated seek times give more than enough of a buffer for the host to catch up, aside from extreme cases like where the HDD goes to sleep.

Cerrseien commented 2 years ago

My point is, how can we justify adding all this complexity if there's no proof that it actually makes any difference?

The reason I didn't post any numbers was because I don't know if there are any red lines that PCSX2 doesn't want to cross, like using undocumented functions and/or using ntdll.dll. If you would've told me that those lines existed we could've stopped this right there and waited until Microsoft adds proper, documented support for this.

Since those lines do not exist (?) I assume that I can now do the testing to actually determine if it is worth the effort. My plan was to disable the frame limiter and see how fast I can do with the same configuration.

(That being said, the new code path wouldn't actually be all that complex if Microsoft hadn't lobotomised large/huge pages support in VirtualAlloc2. Again, since large/huge pages cannot fault, we can get rid of the readahead/prefetch here completely, and the rest is just address calculation).

Is there actually any evidence for this happening? aside from extreme cases like where the HDD goes to sleep.

Or SSD, for that matter. Because that's exactly what's been happening to some of my older ones, and I know that because I wrote a little program that would constantly write to said drive in order to prevent sleep from happening, and that "fixed" the issue back then. Now, I've since upgraded my hardware to not have these issues during normal gameplay, so it's not like I can show you the problem - but only because my new SSD reads 3500 MB/s doesn't mean that other people's SSDs even scratch the low end of RAM speeds (or latencys). (And I don't even want to think about HDDs right now)

stenzek commented 2 years ago

If there's a clear performance win by going down this route, I'm not against using undocumented functions so long as they're reliable. But if not, being brutally honest, I don't think it's worthwhile.

Cerrseien commented 2 years ago

If there's a clear performance win

That's all I'm asking for. :)

refractionpcsx2 commented 3 months ago

ISO precaching is now a thing, so considering this complete

PCSX2 / pcsx2