Persona 5: Random font/texture swapping

UnaKaya commented 6 years ago

Hello RPCS3 Team!

First off, let me just say a million thanks to the dev team. I wouldn't have been able to play this if it weren't for you guys.

I'd like to report an issue regarding Persona 5. This usually happens after playing the game for a period of time. It's the worst when my status bar isn't displayed correctly. I'm not sure if this is a normal thing but it never did this before.

Here are my specs: CPU: i5-7600K GPU: GTX 1060 6GB OS: Windows 10 RPCS3: 0.0.4-6439-ee88e7f94

Sorry, totally new to this. I'm attaching censored screenshots of the issue and as well as the log file. I censored some details to mostly keep it spoiler-free:

rpcs3_p5visbug2 rpcs3_p5visbug1 rpcs3_p5visbug4

Log link: http://www.mediafire.com/file/23u46l5p1qewoxa/RPCS3.log.gz

Drayano commented 6 years ago

It's a pain to find out which version broke the game that much since the bug happen randomly and I haven't found a way to reproduce it yet, unless playing for extended periods of time. There is an old build i was using where i'm 100% sure the bug doesn't happen which is the 0.0.4-6307-5959411ae if someone feels like bisecting which build introduced the regression.

UnaKaya commented 6 years ago

I found that exploring the Metaverse triggers the graphic glitches much faster than when going through Palaces.

UnaKaya commented 6 years ago

BTW, this was the build I used initially.

https://ci.appveyor.com/project/kd-11/rpcs3/build/0.0.2-783

UnaKaya commented 6 years ago

Finally, I noticed that the glitches occur (or WILL occur) after a “semaphore_timeout” error appears during the log.

kd-11 commented 6 years ago

That makes this not an error then. That means another thread (usually spu/mfc) has stalled and we have to make rsx "unstuck". Of course the once that happens its in an undefined state. The current stall detection timeout is 1 second (1000000 microseconds) after which the renderer will be forced to attempt to rescue the emulator state. You can make it infinite if you like in the config file under rsx hang detection timeout variable. Of course this means the whole emulator will hang instead of just aborting the lock wait.

UnaKaya commented 6 years ago

What do you advise kd-11?

UnaKaya commented 6 years ago

I’m not so sure about the “semaphore_timeout” thing since it might just be a coincidence. What i’m sure of is that this happened when I downloaded later builds.

kd-11 commented 6 years ago

Its very unlikely to be just a coincidence - its is a fairly serious issue whenever a cell core stalls and takes down the graphics chip. Try setting a much higher timeout (in older builds the waiting was infinite) like 30 seconds (30000000) in the yml configuration. Generally once you see an error like that, or the other fatal errors about a spurs kernel crashing, expect very bad things to start happening. I'd say, find the next save spot and restart the emulator. There are a few games with very broken emulation where the renderer ignoring the stalling in cell is not a big deal but those are few and its just a lucky coincidence.

UnaKaya commented 6 years ago

Alright, thanks for the support kd-11. I’m glad I reported this, hope it gets fixed!

Thanks a million!

ruipin commented 6 years ago

I see this exact same issue. Upping the timeout to 30 seconds simply causes 30-second-long freezes every 5-10 minutes, and still gives us corrupted textures (although fewer). I tried disabling all SPU tweaks to see if any could have been the cause of the stall but nothing helped.

Reverting to rpcs3-v0.0.4-2018-02-12-95c6ac69 fixes the issue completely even with the default timeout and SPU tweaks enabled. This build does not have an unlimited timeout so that's not the reason. I've been using that older build to play through the game, but sadly it has lower performance than the more recent builds.

As such, I'm not sure whether this should be closed? This looks to me like a RPCS3 regression causing one of the threads to stall, and not a bug in the game.

ruipin commented 6 years ago

@kd-11

That makes this not an error then.

I don't understand how you come to this conclusion - it is something that does not happen when running the game on an original PS3, and does not happen on 2-months-old builds. Could you explain in more detail why this is not a bug in RPCS3?

Do not take this wrong, I'm actually curious why you think this is not a RPCS3 bug. Is there any documentation on this part of the PS3 architecture? I actually do CPU design/architecture as a profession, but even though I've considered it in the past, I have never delved into an emulator before so was wondering how to best approach debugging something like this (assuming it is a RPCS3 bug). The performance difference between the February build and the most recent build on my machine while playing P5 is significant, so I was definitely considering taking a look... (the most recent build achieves stable 30FPS basically everywhere, while the February build gets drops/stutter to the 15FPS range quite often).

MSuih commented 6 years ago

it is something that does not happen when running the game on an original PS3

IIRC this graphical glitch can actually occur on real PS3 as well, it's incredibly rare though because real hardware doesn't stall like RPCS3 does. It is not a bug with emulator but rather an unusual situation that occurs more frequently on rpcs3 and which the developers haven't prepared the game for.

The actual bug behind it is the other threads stalling and taking down rsx with it, and as KD said it affects many other games and not just Persona 5. The grapical glitches caused by it is a bug in the game code.

ruipin commented 6 years ago

@MSuih Thanks for the reply. I can understand that this issue (stalling bringing down rsx) could happen in the real console, but something must have changed to cause it to occur once every 5-10 minutes compared to a 2-months-old build where I haven't had it occur once in 8 hours of gameplay... (I've been playing with a 30-second timeout and it's not triggered even once) As you say, even assuming different stalling behavior, I would expect it to be a rare event, not something that occurs this often.

In other words the RPCS3 bug/regression I'm talking about is that the stall occurs this often, the textures becoming corrupted is only a side-effect.

PS: Is there any documentation around this part of the PS3 architecture? As a CPU designer I'm curious and would like to read more into this.

ruipin commented 6 years ago

Just for further information, considering that rpcs3-v0.0.4-2018-02-12-95c6ac69 (according to my testing) does not have the stall issues every 5-10 minutes when running P5 (it didn't occur once in +/- 8h of gameplay), but 0.0.4-6439-ee88e7f94 does (according to the first post in this thread), we can easily conclude that whatever is causing stalls to occur often was caused by the merged pull request https://github.com/RPCS3/rpcs3/pull/4162.

That pull request touched among other things the RSX. You can see the diff here

I might try to roll back some of that pull request's changes and see if I figure out the exact commit/changes that caused the change in behaviour, just need to get RPCS3 to build on my system...

ruipin commented 6 years ago

I managed to build RPCS3 in my system, and started bisecting the pull request to try and find the exact commit that caused this issue.

First, I checked out ee88e7f94 and tested how reliably I could reproduce this bug: Turns out that entering rooms in the first palace triggers this issue very often. Every time I tried, entering and leaving a single room reproduced this issue very quickly, never took longer than 1 minute.

With this easy way to reproduce the issue, I tried the following commits:

3406cc9 - Could reproduce in under 60s
b67f28e - Could reproduce in under 60s
02e571a - Could not reproduce even after 10 minutes of constantly entering rooms
89c548b - Could reproduce in under 60s

As such, I can conclude that 89c548b is at fault. I believe this issue should be re-opened and we should investigate what caused this change in behaviour.

EDIT: I've spent a few hours trying unsuccessfully to figure out what in 89c548b caused the change in behavior, randomly reverting changes. However, I can always reproduce the issue no matter what I do. Only fully reverting back to 02e571a resolves the issue. I can reproduce this both on Vulkan and OpenGL too, so it looks likely to be one of the changes in texture_cache.h, but nothing stands out... Though it is definitely suspicious that a commit that touched the texture cache causes an issue that includes texture corruption. Could it be that something is getting stuck while handling the texture cache, and that's why when RPCS3 restarts the RSX thread there are corrupted textures?

I did go back to 02e571a to double-check that the issue does not occur when using it. To this end I wrote a quick script that simply moves back the thumbstick and spams X (meaning it will enter a room and exit it repeatedly). I then left it running for about 1 hour while I cooked dinner and RPCS3 never triggered the semaphore timeout, so I am sure that 89c548b caused the issue, I just don't understand why.

ruipin commented 6 years ago

My autohotkey script has been entering and leaving rooms for the last two hours without experiencing a single deadlock. The longest I've had to wait to get a deadlock on 89c548b before was about 10 minutes. As such, I'm confident to say that the diff below (applied to 89c548b) works around this issue when using OpenGL:

index e35dc30e0..8cc70a3f9 100644
--- a/rpcs3/Emu/RSX/GL/GLTextureCache.h
+++ b/rpcs3/Emu/RSX/GL/GLTextureCache.h
@@ -812,28 +812,9 @@ namespace gl
            if (context != rsx::texture_upload_context::blit_engine_dst)
            {
                cached.protect(utils::protection::ro);
-           }
-           else
-           {
-               //TODO: More tests on byte order
-               //ARGB8+native+unswizzled is confirmed with Dark Souls II character preview
-               if (gcm_format == CELL_GCM_TEXTURE_A8R8G8B8)
-               {
-                   bool bgra = (flags == rsx::texture_create_flags::native_component_order);
-                   cached.set_format(bgra? gl::texture::format::bgra : gl::texture::format::rgba, gl::texture::type::uint_8_8_8_8, false);
-               }
-               else
-               {
-                   cached.set_format(gl::texture::format::rgb, gl::texture::type::ushort_5_6_5, true);
-               }
-
-               cached.make_flushable();
-               cached.set_dimensions(width, height, depth, (rsx_size / height));
-               cached.protect(utils::protection::no);
-               no_access_range = cached.get_min_max(no_access_range);
+               update_cache_tag();
            }

-           update_cache_tag();
            return &cached;
        }

An equivalent (untested) workaround for Vulkan would be:

index 50a42e094..337be3400 100644
--- a/rpcs3/Emu/RSX/VK/VKTextureCache.h
+++ b/rpcs3/Emu/RSX/VK/VKTextureCache.h
@@ -748,17 +748,9 @@ namespace vk
            if (context != rsx::texture_upload_context::blit_engine_dst)
            {
                region.protect(utils::protection::ro);
-               read_only_range = region.get_min_max(read_only_range);
-           }
-           else
-           {
-               //TODO: Confirm byte swap patterns
-               region.protect(utils::protection::no);
-               region.set_unpack_swap_bytes(true);
-               no_access_range = region.get_min_max(no_access_range);
+               update_cache_tag();
            }

-           update_cache_tag();
            return &region;
        }

I'm not sure what the underlying issue is, the changes above I believe simply disable blit texture caching completely (something that was added by 89c548b). However, it is certain that the above works around the issue and stops Persona 5 from deadlocking the RSX every 5-10 minutes of gameplay.

I'll try porting this workaround to master and seeing if it also works there.

@kd-11 I am not familiar enough with the way RPCS3 does texture caching to understand what the cause could be. Hopefully since this was your commit you might have an idea. @UnaKaya Could you re-open the issue?

UnaKaya commented 6 years ago

@ruipgpinheiro Done and done. Appreciate what you guys are doing! I've been reading your comments through e-mail notifications and honestly forgot that I haven't reopened the issue.

Tagging @kd-11

ruipin commented 6 years ago

Just as a further update, I ported this workaround to the latest master (i.e. removed the whole "else" statement like the above diffs) and can confirm that I am unable to reproduce the issue even after 20 minutes of opening doors (using OpenGL. at least).

Would be interesting to understand what the underlying issue might be, but sadly I'm too unfamiliar with the code to take a guess.

kd-11 commented 6 years ago

Thanks for finding the trigger @ruipgpinheiro Unfortunately its not so simple as just disabling blit engine cache, and does affect the flow of cell memory read/write instructions indirectly. In fact for that blit engine resources exist but are not guarded so their contents are undefined. I'll look into it later, its kinda really hard to debug something that happens every 10 minutes or so since you never really know if its fixed...

ruipin commented 6 years ago

@kd-11 No problem, thanks for your work!

I've been playing through Persona 5 so this issue was bothering me quite a bit. Downgrading to the old version fixed it, but I noticed that I lost some performance and there were missing effects so I wanted to figure out a workaround that gets me playing on master again. Obviously, it makes no sense to simply disable the blit engine cache for every single game, but for my purposes running a custom build with it disabled for playing Persona 5 is acceptable until the underlying issue is fixed.

As for reproducing the bug, it seems to be able to occur at any point, but loading screens will trigger it most often. The easiest way I've found is to use AutoHotKey to hold down "S" (mapped to thumbstick down) and spam "X" (mapped to the X button). Then just go to the first in-game palace, place your back against a door, and activate the script. The game will then constantly open the same door (entering and leaving the room, which triggers loading screens that take 2-3 seconds on my system) multiple times per minute. 90% of the time, the deadlock occurs in under a minute, although I've had at least one instance where it took around 10 minutes.

Stressing your computer with other stuff seems to trigger it more often as well - I would usually activate the script, wait a few minutes without anything occuring, and then when I open Youtube on another screen and start watching something suddenly I got the deadlock. I also have the impression that OpenGL makes this deadlock occur more often, but I am not completely sure - I might just be imagining things.

As for debugging this, I would be willing to look into it further, but I have no idea where to even start. The texture cache code is "strange" (and I have no graphics background, I'm a computer architecture guy). If you have any tips on how to approach this I can take a deeper look next week.

ruipin commented 6 years ago

I have now had time to test the workaround more. I sadly could not get the Vulkan version of the "fix" to work, disabling the blit engine cache is not enough, apparently.

As for OpenGL, I think i have been able to narrow the workaround further. On the latest master, this diff seems to be enough to avoid the issue from occurring (1h of opening doors with no issues):

index 41a487188..70fd0da89 100644
--- a/rpcs3/Emu/RSX/GL/GLTextureCache.h
+++ b/rpcs3/Emu/RSX/GL/GLTextureCache.h
@@ -992,7 +992,7 @@ namespace gl
                    fmt::throw_exception("Unexpected gcm format 0x%X" HERE, gcm_format);
                }

-               cached.make_flushable();
+               //cached.make_flushable();
                cached.set_dimensions(width, height, depth, (rsx_size / height));
                cached.protect(utils::protection::no);
                no_access_range = cached.get_min_max(no_access_range);

Why this is, i'm not sure, but this diff makes me think it has to do with the texture flushing behavior. Could it be that flushing blit_dst textures is causing a deadlock in certain situations?

I'll see if i can figure out a way to stop the issue from occurring on Vulkan.

ruipin commented 6 years ago

In order to confirm the above suspicion that blit_engine_dst texture flushing is related to this deadlock, I tried the following code using Vulkan:

index e01d3756a..28b98e9f6 100644
--- a/rpcs3/Emu/RSX/VK/VKTextureCache.h
+++ b/rpcs3/Emu/RSX/VK/VKTextureCache.h
@@ -132,7 +132,7 @@ namespace vk
        bool is_flushable() const
        {
            //This section is active and can be flushed to cpu
-           return (protection == utils::protection::no);
+           return (protection == utils::protection::no && context != rsx::texture_upload_context::blit_engine_dst);
        }

        bool is_flushed() const

I can confirm that with the above diff applied to the latest master, I have experienced no deadlocks on Vulkan in over 30 minutes of opening doors.

kd-11 commented 6 years ago

Well, that just disables flushing on blit engine derived sections. The problem itself isn't graphical- its deeper than that. Protected memory sections are protected as their data exists on the host GPU and not the cpu where Cell is being emulated. When the game decides it want to read something that does not exist on the cpu, an access violation is raised - this is the tricky part. If the data does not exist, its forcefully copied back, a process that is fairly quick as there is a cheap prediction algorithm to anticipate this. However, it takes enough cycles to memcpy large memory areas (framebuffers and texture regions can easily consume upto 16M) to desynchronize threads. Due to lack of memory mirrors, there is also a brief moment where the memory is unprotected, but has invalid contents. i.e Picture 6 racing spu threads running some task in parallel modifying data existing on the GPU. One thread reaches the violation handler before the others and acquires the cache mutex. It unprotects the memory range to write the gpu-resident data on the cpu side. However, another thread reaches this section late due to host scheduling and finds the memory unprotected, so it proceeds to consume what is now undefined garbage since the first thread is still writing the data. Its an architectural flaw with no simple solution and it affects many other games in subtle ways that often seem to resolve themselves. An easy way to test this it to run something with interpreters only and then surround memory flush with a system pause/resume to block everything until the transfer is completed (terrible for performance and causes microstutter). Another title affected by obvious corruption is demons souls, but minimizing the undefined transfer window greatly reduced the number of occurrences.

kd-11 commented 6 years ago

I should also mention - blit engine is really just a DMA engine, it does not necessarily transfer images/textures. Its often used to move program code around as well which is why any data its holding needs to be synchronized tightly. The dma transfer has no context around it making it impossible to determine what is being transfered.

ruipin commented 6 years ago

Thank you for the explanation. So if I understand you correctly, you're saying that the semaphore freezes are likely due to one of the threads consuming an area of memory that another thread is still writing, causing it to lock up? That could certainly explain why the freeze occurs only with the blit engine (e.g. if it starts executing incomplete code in a DMA region while another thread is busy writing it).

However, I'm not sure I understood the whole issue. The emulator seems to be aware of the on-going flush, so it could acquire a per-region mutex and prevent any further accesses to that region while the flush is incomplete, to explictly prevent accesses to the invalid contents. This obviously would sacrifice some performance due to acquiring extra locks, but at least a whole system pause wouldn't be needed and the corruption would be avoided. Since this seems like a simple solution, my assumption is that I misunderstood something.

Additonally, I'm not sure I understand then what happens when I disable flushing of blit engine sections, and how that stops the deadlock from occurring. Wouldn't it simply make the data impossible to read from the CPU? As such, wouldn't any attempt by the emulated CPUs to access a GPU memory region owned by the blit engine fail? Does the emulator simply duplicate those memory regions without synchronizing?

(For now, I'll be playing Persona with the above diff applied. It does not seem to harm the game at all, and prevents the deadlock + texture corruption issue from occurring.)

kd-11 commented 6 years ago

The access does fail! This is by design. Rpcs3 implements its own segfault handlers for this reason, access->segfault->flush and make pages available->proceed. There is no way to know what a 'region' consists of, pages can be shared between many things for example but the page will still be marked no-access if its data is not present. The memory protection is the region lock and it is not manageable with mutexes outside of guaranteed execution paths (the violation handler is mutex handled in the cache, but that means nothing if the thread does not hit the violation). We need memory mirrors, but for obvious security reasons no OS will let you do this easily. As for p5, the fact that it is indeed reading memory that doesn't exist means there is something wrong with it disabled (it will read something as all 0) but its likely too subtle to notice. The same was true before some memory respecification code was made available - most users only noticed the difference when it was working properly. There is also the possibility that the fault is not caused by reading garbage but rather a partial page with shared blit resource memory and some other code getting corrupted by the flush. Hopefully its the second simpler case as the first one is really hard to fix properly without performance loss. I have also noticed the issue has seemingly gotten worse recently - no timeouts I think, just textures getting randomly garbled so it is something to look into soon.

kd-11 commented 6 years ago

My post did little to answer your questions so I should maybe clarify a bit:

No, the emulator hangs because a cell thread does not follow the intended path and either another thread commands it to abort what its doing or it writes wrong data to the destination. Thats how nvidia GPUs are synchronized with 'semaphores'.
As for how disabling blit engine flush fixes the problem - the access violation has no writes just unprotect and proceed. Its much much faster because of this so the thread will keep running. I can write a simple test case to prove what is the cause of the situation. Just add a busy loop or sleep when flushing a blit engine resource, after unprotect but before violation handler returns.
As for synchronizing between gpu and cpu sections, thats what the access violation handler is for! Disabling flush disables this synchronization.

Since all we get rsx-side is the hardware view of the application, there isn't much you can control for. There is no context to any actions the game is doing, just random commands telling the hardware units to do stuff.

kd-11 commented 6 years ago

Ran a few tests and got interesting results.

Disable the actual flushing (the memcpy, but keep the general flow the same which is still very fast) - all good
Enable flushing but intentionally write trash - broken
Repeat 2 but guard the flush code with a full emulated system pause then resume after memory has been flushed/synchronized - All good despite intentionally writing trash (intentional corruption)
Use your workaround but make the unprotect for blit engine slow if violation is caused by a write - Always hangs on loading screen.

I also checked a few things and discovered the following:

The game uses some allocation algo so memory isnt always used for the same thing every time
The corruption indeed happens because of racing SPU threads reaching the violation handler at different times. Likely some decompression handled in parallel (its a spurs job thread). This game is known for using all available spurs jobs to concurrently operate on one task.
Wrapping flush with a system pause/resume fixes the flaw but its not perfect (I'm using recompilers so they will run ahead and stop 'at a bad time' since a block is executed in its entirety before checking emu status. This shows that its indeed the writeback race condition that causes this.

Unfortunately, other than rewriting the whole virtual memory management, the only way around this is to just 'Force CPU blit emulation' (without WCB). This game does not need it - it blits over compressed textures and the result is rejected later anyway because no current graphics API allows typecasting arbitrary RGBA data into DXT/BC.

Each run was done 5 minutes at a time with load screen spamming and some code changes to increase frequency. When using OGL its easy to make the emu deadlock without actually spewing the semaphore timeout if you know where to poke.

ruipin commented 6 years ago

Thank you very much for your time and explanation, it all makes much more sense now. I've actually had a few instances of full-on deadlocks when using OpenGL without the semaphore timing out, but I just assumed I had broken something while testing, I guess it wasn't a coincidence.

I'm not sure if there's anything else I can do at this point to help - I'm a computer architecture / hardware engineer guy, not a software engineer, so while I understand what the issue is, it's a bit out of my league to fix! Let me know if there's anything else I can do.

I guess I'll just force CPU blit (with WCB disabled) while playing this game. I'll have to test what kind of impact that has on performance - worst case, I'll just use my workaround above instead!

You mentioned before that other games have corruption because of similar reasons. Am I incorrect that these games also have no easy way to fix the issue without rewriting the whole virtual memory management system? Because from what you say, if the emulator wants full compatibility then that will be necessary at some point, right?

kd-11 commented 6 years ago

Yes, these other titles also are not fixable until we can map one physical page to more than one virtual address from user space. There are ways to make it work on windows but you're basically fighting the OS by doing so. Its not a simple task but something has to be done to ensure the operation is completely hidden from the application. Do note that using cpu blit without wcb will likely remove some stuff like the semitransparent backgrounds during pause menu and the transition between loading screens so you can use your patch in the meantime.

ruipin commented 6 years ago

PS: You've probably already thought of this, so I'm probably wrong, but wouldn't it be possible to do "double-buffering" on the memory pages? For example, do not unprotect the old page, just allocate a new page, flush the data to it, and then remap the new page to the old page's virtual address. Once that is done, deallocate the old page. (or just allocate two pages from the start and simply switch which one is mapped where).

kd-11 commented 6 years ago

That has the same problem as the other solution - you cannot freely manipulate the paging tables from userspace. No OS will just give you that kind of access.

ruipin commented 6 years ago

Hmm, off the top of my head, doesn't windows support multiple memory views of the same physical address by requesting a file mapping with an invalid handle? CreateFileMapping seems to explictly state this supported:

If hFile is INVALID_HANDLE_VALUE, the calling process must also specify a size for the file mapping object in the dwMaximumSizeHigh and dwMaximumSizeLow parameters. In this scenario, CreateFileMapping creates a file mapping object of a specified size that is backed by the system paging file instead of by a file in the file system.

Then you can map that page somewhere else using MapViewOfFileEx, no? I thought mmap on linux had something similar as well.

Anyway, I digress. I trust that you know what you are doing, so my suggestions are probably just due to my lack of domain knowledge, and are a waste of your time. Thank you very much for your help, I'll use my workaround patch above to keep playing. Let me know if there's anything I can do to help.

kd-11 commented 6 years ago

Thats what I meant when I said the current memory management code would need to be written wholly to only work with file mappings. I'm also unsure if *nix OSes support this kind of feature in a straightforward manner. We currently use VirtualAlloc/VirtualFree/VirtualProtect to manage the virtual memory for the emulated ps3 system.

ruipin commented 6 years ago

This article seems to discuss the above approach on Windows and a similar approach on POSIX systems (although the latter seems to be semi-hackish).

kd-11 commented 6 years ago

I remember having a long conversation about this very issue with other developers when I first figured it out. I cannot remember the counter-points to it, but I do remember all these options were on the table back then. Maybe the emulator has come a long way enough that this is now easier to do.

ruipin commented 6 years ago

So, I've spent some time actually playing the game with the above workaround on the latest master using Vulkan, and have some updates to this bug:

While the semaphore timeout didn't occur during loading screens anymore, there is one location in-game where it occurs all the time. If I travel to the "Station Square" location, either manually of because of cutscenes, I get a dealock every 5-10 seconds even without loading screens.

I'm not sure what could be causing it. I guess there's something in Station Square that triggers the deadlock issue, so I've just started avoiding Station Square as much as possible.

Compared to 95c6ac69, there is a lot of "new" texture corruption on scene transitions. For example, when the loading screen shows up with people walking, the black background will sometimes be colored randomly (i.e. they'll contain uninitialized memory). These "random textures" will show up very often, mainly in scene transition animations, but also in the menu and anywhere where there is UI drawn on top of a static image of the normal graphics.

I bet that these animations/UI use the blit engine, and all that's happening is the workaround is making it show uninitialized memory. I'm curious as to why 95c6ac69 did not have these issues, since as far as I understand blit engine textures were not even synchronized at all.

Nevertheless, the performance increased and graphics improvements (extra effects) from the current master are important enough to me that I'll continue playing with the workaround. Just thought I should update this bug with this new information.

kd-11 commented 6 years ago

Virtual memory mirrors are being implemented by Nekotekina, so the root cause will be resolved soon. Its not hard to see why not synchronizing blit engine magically fixes everything; the flush overwrites new data in memory as it is being written when flush is enabled. Without flush but with protection active (your workaround), you still have the problem of the access violation 'detour' which is not fully transparent to the application as the faulting thread will cause a task to take too long. This is also why you will still get deadlocks without timeout message. The main cause of problems will be resolved once memory mirrors are properly integrated (all threads will access the protected region at the same time with correct data being guaranteed). Whatever is left may take a bit more time, but will likely not be as difficult to deal with.

ruipin commented 6 years ago

@kd-11 It is great to hear a fix is actively being worked on now! Thanks for your work, can't wait to have these issues resolved correctly without need for a workaround.

ruipin commented 6 years ago

I wanted to play some more and noticed that Nekotekina added support for memory mirrors in https://github.com/Nekotekina/rpcs3/commit/5d15d64ec8e5c37eb1308ce1a139f31bd074d426 but has not yet modified the texture cache to use these.

I decided to try and build rpcs3 with the texture cache using the memory mirror functionality added by the above build. Due to my lack of knowledge of code base, this is very hackish, but basically I checked out the above commit and modified vk::cached_texture_section::flush() to use vm::get_super_ptr when calculating pixels_dst. I then modified texture_cache.h to avoid any unprotect()/remove_one() call when the memory would be reprotected after the flush and disabled deferred_flush in invalidate_range_impl_base (I probably broke something in the cache while doing these changes, but haven't noticed crashes / memory leaks yet, so I don't mind).

With those changes (which I wasn't expecting to work), I have been able to play (on Windows+Vulkan) for over one hour without a single deadlock and only two corrupted textures in total (I suspect caused by me breaking something in cache; still it is much better than with the previous workaround). I have also spammed loading screens for 30 minutes with no issues. This seems to confirm that the issues were completely related to the unprotect/reprotect race condition during flush.

Diff for those interested (Remember, this is hackish and probably has many issues... Use at your own risk.)

kd-11 commented 6 years ago

Thanks for confirming. Will add a proper fix before merging the wip pr.

Seafra01 commented 6 years ago

I am using a customized version based on 0.0.5-6675(The only edit is enabled TSX for Haswell CPUs). And I found something interesting, it seems as long as I press 'triangle' button which opens the menu in game, the texture glitch could be triggered very likely. Or not, no matter how many locations I teleport, how long I have played, the graphic is still good. Seems there is a connection between them, right? I think this might be helpful. So I decided to post here. Sorry for interrupting your dialogues.

ruipin commented 6 years ago

@Seafra01 I have also noticed that recently. Repeatedly opening/closing the menu triggers the issue very quickly. I believe it's because doing anything that involves the blit engine can trigger the issue, and P5 seems to use it a lot for the UI transitions (loading screen start/stop, opening/closing menus, entering/exiting submenus, etc).

Nevertheless, the memory mirror fix seems to resolve all of them. Now that memory mirrors have been merged to master in 5d15d64ec8e5c37eb1308ce1a139f31bd074d426 we just need to wait for the texture cache to be updated to use them.

kd-11 commented 6 years ago

https://github.com/RPCS3/rpcs3/pull/4505 should fix this issue. Note that memory mirrors have a quirk that is making it possible to reach a deadlock in some race conditions, but it does not seem to be hit by this game.

ruipin commented 6 years ago

Thank you very much for your work. Next time I play I'll let you know if I see any texture corruption at all.

From reading that pull request, am I right that memory mirrors seem to have caused a bunch of regressions? People seem to be reporting that the blit engine is now broken in many games. I'm curious why that is.

kd-11 commented 6 years ago

Its resolved. Remaining issue would need a very lengthy explanation regarding how vm locks work in rpcs3 and how violation detours can lead to deadlock. Most games are now unlikely to trigger this deadlock though.

ruipin commented 6 years ago

@kd-11 I just tested 6f1c67ed3b619e2060c35b1bfd76ecb2542204a2 and got a deadlock in the first 20 seconds of opening/closing the menu... This did not happen with my very hackish fix above (I played over 4 hours using it with no deadlocks).

EDIT: Strange, I can't seem to cause a deadlock to happen again on the same rpcs3 instance by simply opening/closing menus. Huh.

EDIT2: It happened again shortly after I caused some loading screens and started opening/closing the menu again. Also had a loading screen cause it immediately afterwards.

EDIT3: texture_cache.h changed enough that I have not been able to "hackishly fix it" like before... I've tried reinstating my changes above but everything ends up in my save game not loading or deadlocks continue occurring.

(This is with Vulkan on Windows 10, SPU threads on Auto)

UnaKaya commented 6 years ago

Hello everyone! I'm no computer expert so I'm afraid I can't contribute more to the code...but I really appreciate what you guys are doing. Please do let me know if I could be of help somehow. I feel like Mishima right now haha

Anyways, gonna support @ruipgpinheiro on this one. Tried #4505 but the graphic glitch still happened

untitled

kd-11 commented 6 years ago

@UnaKaya that is a different glitch. The glitch in question isnt the menu one, its the 3d objects in the world getting corrupted. Things like the personas themselves or random objects in the game world. This error is partially related but its not going away soon (requires memory merge support which is nontrivial and waay too much effort right now for minimal gains).

UnaKaya commented 6 years ago

Ohh, I see. But if we're talking about 3D objects getting corrupted, yep it still happens. I battled a certain boss and Ann's head was a blue mess from the Matrix, Sandalphon's wings were fuzzy, etc.

I'll take a screenshot when it happens again. Quit playing P5 for a moment since it hanged haha!

RPCS3 / rpcs3

Persona 5: Random font/texture swapping #4298