bo3b / 3Dmigoto

Chiri's DX11 wrapper to enable fixing broken stereoscopic effects.
Other
728 stars 111 forks source link

Stereo2Mono MASSIVE FPS and CPU USAGE issue #53

Closed helifax closed 6 years ago

helifax commented 7 years ago

FarCry Primal:

The fix HEAVILY relies on this function! Without it the FPS in 3D is 50FPS vs 25FPS + fix!

DarkStarSword commented 7 years ago

Notably, lowering the water quality to one below the max (don't recall off hand if that was "very high" or "high") significantly improves the framerate in SLI.

The performance of stereo2mono is largely dependent on the 3D Vision driver - there might be some improvements we can make in 3DMigoto, and it's definitely worth measuring to make sure we aren't doing anything stupid that is needlessly costing us, but I strongly suspect that the bulk of the performance impact will come from the reverse blit operation in the driver itself, more so in SLI since it will require a heavy weight synchronisation between the cards.

The trade off is fix complexity - in this case it would be possible to avoid using stereo2mono altogether by adjusting each of the shaders responsible for drawing an object in the reflection (and only when they are drawing an object in the reflection) to reverse the stereo correction applied by the driver so that the reflection is already swapped ahead of time - the problem is there are hundreds of these shaders, and they are very easy to miss (lesson learned from trying this approach in Far Cry 4 and still finding more shaders a year later).

Maybe later when we have scripting support in 3DMigoto we might be able to automatically apply that type of correction to every shader used with the reflection render targets to give us the best of both worlds.

helifax commented 7 years ago

Makes sense;) I had a feeling it has something to do with Nvidia 3D Vision driver. It only appears on resolutions higher than 1080p though. On a 1080p resolution I don't see any performance loss. I don't know what exactly can be done though, like you said, manually fixing thousands of shaders is not really an option;(

bo3b commented 7 years ago

Which version of Witcher3 should I use to profile this test case? The one with the dynamic cursor? (I don't have Far Cry Primal, nor SLI)

I can create a fake higher resolution with my setup, to run at 1440p on my 1080p monitor. It uses the nvidia video card to do the downsampling, so as far as the game is concerned, it is running at 1440p.

bo3b commented 7 years ago

Quick look at Witcher3 performance. 1.5% is double what we had before, which is undesirable, but it's not too terrible. In CPU bound case that extra 0.7% will have an outsize impact.

image

helifax commented 7 years ago

Hi bo3b,

Yes the one with the dynamic cursor. Sorry for not being able to make the performance tests to show it exactly. I am not sure the problem is in the wrapper or the Nvidia driver. (I suspect the later, as it seems it breaks with bigger resolutions). I can't remember who, but it reported the same problem in FarCry Primal. (GPU and CPU never reaching full usage) until you remove stereo2mono from the ini file. I'll try to make the performance comparison today to see exactly what I am saying;) In any case it would be good to know about this limitation if it proves is a problem in the driver.

helifax commented 7 years ago

Made a quick test on FarCry Primal (5760x1080):

With Stereo2Mono: Frames: 1535 Time : 60 sec Min: 24 Max: 26 Avg: 25.583

Without Stereo2Mono: Frames: 2453 Time : 60 sec Min: 39 Max: 43 Avg: 40.883

(I simply removed all the references to stereo2mono in the ini file - all of them)

I am attaching the full Fraps Benchmark results here: Benchmarks.zip

25 avg FPS represents approx. 61 % performance from the 100% (of 41 FPS). This means I am losing close to 40% in performance :-s

Hope this helps;) Let me know if I can help out;)

PS: The Fix version of Witcher 3 that has the dynamic Crosshair doesn't suffer from this performance loss. All newer versions of 3DMigoto have this issue though, but ofc lack all the new features. I am wondering if something didn't change after that version as it seems to affect higher resolutions (the higher the resolution the bigger then performance impact).

bo3b commented 7 years ago

OK, but I don't have Primal in order to be able to analyze anything. I need a test case that I can attach directly with Visual Studio, this isn't going to be something I can do with a debug log or top level performance metrics.

Maybe there is another game that demonstrates the problem?

In the Witcher3 case (which I have) does the problem show up if I replace the dlls, but keep the d3dx.ini file?

Or maybe MGS_GZ, which I also have. Or other shared account games.

helifax commented 7 years ago

Not a problem bo3b;) I'll try to make it reproducible on Witcher 3;) On a single Screen I can't see it (but maybe that is because I run 2x980Tis and for a single screen the GPUs can "cope" with the overhead?) I am still very puzzled why the "older" version (which I need to look up as what version it actually is) works perfect and all the versions after aren't :-s

I'll try to make a reproducible test-case tomorrow;)

DarkStarSword commented 7 years ago

Did I miss something - are you saying that performance of stereo2mono was better in an older version, or are you referring to some other regression?

helifax commented 7 years ago

Yes, I expect it to be some type of regression. Didn't look in the code to see exactly what changed, so I can't really say if is regression or in the previous version the feature wasn't fully implemented.

I did look-up for the versions of the wrapper: version_compare

v.1.2.1 no performance loss ( I think this is among the first version to have the stereo2mono function and it was the wrapper used in the Witcher 3 first fix/release).

v.1.2.40 performance loss. This is the one that I used for Witcher 3 fix update.

Please bear with me. When I get home this evening I will try to make a reproducible scenario for Witcher 3 with both DLL files to show exactly the difference.

Edit: This is the reason I released the re-fix of Witcher 3 with 2 variants and only one has the dynamic crosshair. If I use the same code for dynamic crosshair (stereo2mono) in v.1.2.40 my performance drops from 40 FPS (5760x1080@3D) down to 5 FPS.

I'll also try to make a few videos that show it as well;)

DarkStarSword commented 7 years ago

stereo2mono was introduced in 1.2.11 - if we were using a dynamic crosshair before that it would have been the less accurate version that is calculated independently for each eye (but depending on the game may still be quite acceptable).

Wow, 5fps? Nasty

DarkStarSword commented 7 years ago

Which version should I be looking at to see the same thing you are seeing? The 1.21 variant doesn't mention stereo2mono in the d3dx.ini and 1.22 has it commented out in the HBAO+ shader override section - should I uncomment that?

helifax commented 7 years ago

Yes, I remember I took your Witcher 3 version (the one that you started to update to use stereo2mono) and it was a FPS killer for me. Hence I reverted to the old version where dynamic crosshair was working perfectly fine;)

Later for SBS I had to use the newer version of 3DMigoto and I noticed the same issue, so I scrapped the dynamic and stereo2mono from it;)

I made a few videos: v.1.2.1 with Dynamic Crosshair: https://youtu.be/AJ25IaTwz2Y

v.1.2.40 without Dynamic crosshair but stereo2mono enabled: https://youtu.be/VxDd_JTqENA (Also the GPU and CPU seem to "wait" quite a lot in this case)

v.1.2.40 without Dynamic crosshair and stereo2mono disabled: https://youtu.be/WyNvFGTISXA

How to reproduce:

  1. Get the fix The Witcher 3 v.1.22 - 3DMigoto 1.2.40
  2. Run it like it is (stereo2mono disabled)
  3. In the ini file find and uncomment this line: ;post ResourceWBuffer = stereo2mono o0
  4. Some UI will be at depth (as I changed the shaders quite a lot ui will be broken).
  5. Note the FPS reduction.

I am unsure where the issue lies. Yes, as you noted 1.2.1 is before stereo2mono was introduced :( so maybe this was always the case? Yet, I can't understand why this is an issue on higher resolutions:-s

Let me know if I can provide any more information;)

Edit: for the sake of testing I was in a light environment to show that even here there is a big problem. If I go inside a city the FPS tanks down to 5 FPS:( In a "lite" scene were only a few things are drawn I still see a 40-50% performance drop (like I see in FarCry Primal).

bo3b commented 7 years ago

Thanks for those details.

Don't know what might be happening here. I am unable to reproduce the problem with single screen, no SLI, GTX 970, driver 368.22. I'm using an in-game resolution of 2560x1440.

Running two tests back to back, one with the post stereo2mono commented out, then active, I don't see any performance impact. Running VisualStudio profiler against the two scenarios also shows no significant difference.

Frames, Time (ms), Min, Max, Avg ;post ResourceWBuffer = stereo2mono o0 411, 18722, 18, 25, 21.953 post ResourceWBuffer = stereo2mono o0 410, 18674, 19, 24, 21.956 uninstall.bat 503, 20567, 22, 28, 24.457

helifax commented 7 years ago

Thanks for trying it out. I wonder if this is in't a SLI specific issue. I got somebody reported it to me on FarCry Primal hence I opened this ticket. But, I think that user also had SLI. In which case it might explain the extra long delays as DSS explained - having to sync both GPUs.

Out of curiosity, I know you had a GTX690 at some point. Is that card still available to you? It would be good to know if this issue is at least SLI specific;)

DarkStarSword commented 7 years ago

A little background on how stereo2mono works: nvapi exposes something they call the "reverse stereo blit", which takes a stereo resource and turns it into a double width mono resource. The documentation does not state it, but I suspect that the feature was likely intended more for debugging (such as we do when we use it for frame analysis) rather than production.

There is a rather odd quirk to the feature: It only works if the destination resource is also a stereo resource (and also 2x width, so 4x total), but it only fills in one eye of that resource - fine if we pull that back to the CPU for debugging since we lose the second eye anyway, but problematic to inject that resource back into the game where we need both eyes to get the same data. Because of this, 3DMigoto performs a second copy to turn that stereo resource back into a mono resource.

Hypothetically, assuming that SLI is configured such that each card is rendering one eye independently of the other (there are other modes that SLI may be doing, but I'm not considering those), the communication between the cards may look something like this:

  CARD 0             CARD 1

  -----              -----
  | 1 |              | 2 |
  -----              -----

     reverse stereo blit
from 1x width stereo resource
  to 2x width stereo resource
  |
  v       <-- 2 --
---------          ---------
| 1 | 2 |          |   |   |
---------          ---------

       resource copy
from 2x width stereo resource
  to 2x width mono resource
  |   |
  v   v
---------
| 1 | 2 |
---------
         -- 1+2 -->
---------          ---------
| 1 | 2 |          | 1 | 2 |
---------          ---------

I do not know if that matches reality - if this feature was not designed for production they may have just done enough to make it work, and not necessarily efficient - it may have additional synchronisation or communication that I am not aware of. Even hypothetically speaking, this approach clearly has excessive synchronisation since card 0 has to wait until it has the data from card 1 before the second operation can even begin.

If the feature was better designed, it could theoretically do this:

  CARD 0             CARD 1

  -----              -----
  | 1 |              | 2 |
  -----              -----

     reverse stereo blit
from 1x width stereo resource
  to 2x width mono resource
  |        -- 1 -->      |
  v       <-- 2 --       v
---------          ---------
| 1 | 2 |          | 1 | 2 |
---------          ---------

The communication between cards is still going to be expensive - there's no way around that (laws of physics and all that), but this approach eliminates the ABA dependency leading to less synchronisation and better pipelining and reduces the amount of data that needs to be transferred. It would also help the non-SLI case since it reduces the amount of data that needs to be copied, but the cost is already far lower on a single GPU so it wouldn't be as pronounced.

But anyway, a lot of that is out of our control. The question we need to answer is if there is anything we are doing in 3DMigoto that is costing us more than it should. The increased CPU usage is probably worth looking closer at - at least trying to work out if that is coming from us or from the driver.

helifax commented 7 years ago

Yeah, it makes sense. I don't think there is anything wrong in what 3DMigoto does and I expect it is an un-finished feature in the driver or not optimized for surround and SLI (or SLI for that matter).

Speaking of which, you said that v.1.2.1 had an approximation and wasn't using stereo2mono. That approximation method seems to have worked perfectly fine. I was wondering if that method is still available to use or if we can't do something like "stereo2mono2" that uses that method instead of the one exposed by the driver. Just a question;)

DarkStarSword commented 7 years ago

Oh yes, you can still use the approximation - just copy the depth buffer without using stereo2mono and calculate the position crosshair in each eye individually. The time where it is less accurate is when the crosshair lies on the edge of an object or when an object is partially obscured by another - using stereo2mono allows you to choose whether the crosshair lines up with the closer object or the further object, while without it one eye will line up with the close object and the other will line up with the further object. I think I was experimenting with stereo2mono for Witcher 3 to try to make the name plates more likely to stay in front of the scene.

The code to adjust from the depth buffer is a little different - crosshair.hlsl in Far Cry Primal has up to date versions of both (adjust_from_depth_buffer vs. adjust_from_stereo2mono_depth_buffer) if you want to compare (but remember that there are often differences between games that may require some small tweaks to the code), or pull the old Witcher 3 variant of adjust_from_depth_buffer() from the git history, which was removed in this commit: https://github.com/bo3b/3Dmigoto/commit/f6efcac14201b1c55363d901197f412adc320a7a

bo3b commented 7 years ago

DarkStarSword, since you have SLI 970 now, would you have time to do a quick Witcher3 to see if you can reproduce the problem? If we know for sure that it's an SLI problem, I can dig out my 690 to profile it, but I don't have a lot of time at present and only want to do that if we know for sure it's SLI.

It's always possible Primal is a better test case, since you have it.

The Witcher3 test is to use the fix here: http://3dsurroundgaming.com/3DVision/Witcher_3_1.22_3DM_1.2.40.rar

and by default it has the stereo2mono commented out. Two tests back-to-back, one commented out, one active. Top-level only average FPS is fine for a quick test, because it should be obvious.

bo3b commented 7 years ago

A little background on how stereo2mono works: nvapi exposes something they call the "reverse stereo blit", which takes a stereo resource and turns it into a double width mono resource. The documentation does not state it, but I suspect that the feature was likely intended more for debugging (such as we do when we use it for frame analysis) rather than production.

There is a rather odd quirk to the feature: It only works if the destination resource is also a stereo resource (and also 2x width, so 4x total), but it only fills in one eye of that resource - fine if we pull that back to the CPU for debugging since we lose the second eye anyway, but problematic to inject that resource back into the game where we need both eyes to get the same data. Because of this, 3DMigoto performs a second copy to turn that stereo resource back into a mono resource.

Great detail here, thanks for that.

As a data point, I'm fairly sure that the reverse_stereo_blit has other code paths that are active. When copying from the back-buffer, I'm pretty sure that I did not need to make a stereo destination. And I think this is how we do stereo snapshots during Mark. On the other hand, sometimes we get a blank eye in screenshots, so maybe it would be more reliable if it were a stereo destination.

For a full screen copy, like using the back-buffer to make top/bottom images, I'd expect that SLI data copy to be important and possibly a bottleneck. However, for small stereo images like crosshairs, I'd be very surprised if the SLI copy was significant. More likely to be synchronization, or thread blocking.

helifax commented 7 years ago

Yeah, it definitely looks like a LOCK someplace is happening as both GPU and CPU seem to be lower than without it... :(

DarkStarSword commented 7 years ago

I'm only seeing about a 5fps hit from using stereo2mono in Witcher 3 with SLI 980s at 1920x1080 - roughly 45fps without it down to 40fps with it enabled. With SLI disabled I get about 30fps regardless of whether it is enabled or disabled. Notably I haven't updated drivers in a few months - I'm still on 368.81. Also worth noting that I'm on Windows 10 right now (I can repeat the test on Windows 7 later) and version 1.22 of Witcher 3.

DarkStarSword commented 7 years ago

Ok, I repeated this with DSR to simulate a higher resolution. I only checked the non-SLI case on a couple of resolutions since even at 4xDSR stereo2mono did not seem to have any measurable impact. I had to vary the quality settings as specified below to ensure a high enough starting fps in SLI to measure the impact that did not exceed the cap of 60 - I aimed for mid to high 50s. In each case postprocessing was set to the lowest preset, except that HBAO+ was always enabled since the stereo2mono directive is triggered by it:

1920 x 1080 (16MB stereo depth buffer) Ultra quality No SLI: No measurable effect on fps SLI: ~5-8fps hit

1.2xDSR 2103 x 1183 (22MB stereo depth buffer) Ultra quality SLI: ~23fps hit

1.5xDSR 2351 x 1323 (24MB stereo depth buffer) High quality SLI: ~23fps hit

1.78xDSR 2560 x 1440 (28MB stereo depth buffer) Medium quality No SLI: No measurable effect on fps SLI: ~26fps hit

2xDSR 2715 x 1527 (34MB stereo depth buffer) Medium quality SLI: ~26fps hit

2.25xDSR 2880 x 1620 (36MB stereo depth buffer) Low quality SLI: ~30fps hit

3xDSR 3325 x 1871 (48MB stereo depth buffer) Low quality SLI: Impossible to measure - only getting 2fps in 3D even on lowest settings

4xDSR: 3840 x 2160 (64MB stereo depth buffer) Low quality No SLI: No measurable effect on fps SLI: Impossible to measure - only getting 3fps in 3D even on lowest settings

Definitely limited to SLI.

At first glance the performance hit appears to scale with the size of the depth buffer - and that definitely is a factor since it determines how long a transfer takes (and therefore the worst cast time for how long the other GPU must wait if it needs all the data)... but, after varying the quality so much to get those numbers I noticed a few outliers that suggest there was another factor at play. e.g. in the table above I was aiming for a starting fps in the 50s, but when I tried 2560x1440 at Ultra quality (resulting in a starting fps in the 40s) I only got a ~13fps hit - if anything I think the final fps was about the same regardless of quality.... I'd need to perform more tests and record more data to confirm.

One explanation that might account for that is that at a higher quality setting the GPU may have other tasks it can get on with while it waits for the data, whereas at lower settings it has less tasks to do, meaning the time between the HBAO+ shader when the copy is initiated to the first HUD shader that needs the buffer is shorter, and if the transfer has not completed it has to waste time idling while it waits for the transfer to complete.

Keeping in mind my previous explanation of how stereo2mono works, the minimum amount of data transferred between GPUs (assuming that they are no less efficient than they could be) is half the value quoted above in one direction plus the value quoted above in the other (so 1.5x total), and the second transfer cannot begin until the first transfer has fully completed, and the HUD shaders cannot begin until the second transfer has fully completed.

DarkStarSword commented 7 years ago

I've been thinking about this idea for a while (for other reasons), but what would happen if we downscaled the depth buffer using a custom shader before we did the stereo2mono copy? I doubt it would make a significant difference to the accuracy, but would significantly reduce the amount of data to transfer. For some games if we know we only need to depth at a specific location (e.g. fixed crosshair position in first person games) we could even go one better and use a custom shader to write just that value into a custom resource so we then only have like 4 bytes to copy to the other GPU.

These ideas don't get away from the fact that there is significant latency associated with any communication between GPUs regardless of how much data is transmitted, but would mean that the transmissions take less time from start to finish... might be worth experimenting with.

It doesn't help us for FCPrimal though, since in that case we need to transfer a full colour buffer around for the reflection... although... if memory serves the size of that buffer is one of the things that the water quality setting affects (the other being how accurately the ripples are simulated and whether they follow the contours of the river or are just static), and as I already noted lowering the water quality by one notch significantly helps the SLI performance in that game... that would seem to suggest that the size of the buffer can be significant (although there may be other factors at play - I'd need to do some experiments there to confirm)...

helifax commented 7 years ago

Big thank you for making those tests. I saw you managed to reproduce it quite nicely.

What I remember reading is that in order to sync both GPUs some of the data is on the private bus (SLI connector) but the huge chunk is going through the PCI-E lane. Only the new SLI connector is supposed to have 4x the bandwidth of the old SLI connector and all the sync data to be shared via SLI and not through PCI-E slot. If, for example, having a Pascal SLI with the new connector, would solve anything here or not is hard to say, but yes, it seems this is a problem with SLI and the reverse blit.

I am wondering if we can't actually upscale : Render the buffer internally to a smaller resolution: 3840x720 and then scale it to 5760x1080 for example. Sure, we will lose some "accuracy" but we should gain performance - if transferring the buffer data is the problem;) It would be interesting to see if this is the case;)

Another idea, question I have : When stereo2mono is called (for all shaders) is the reverse-blit operation done every time or just one time when we specify the shader from where to copy the depth buffer and later is just acting as an indication to use the depth-buffer copy ? If it runs more than once, I wonder if we can't just run the reverse-blit one time and then just re-use it. I bet you understand what I want to say exactly;))

DarkStarSword commented 7 years ago

In this case the reverse stereo blit will only be performed once since it's triggered by the HBAO+ depth pass shader, which only runs once per frame. The HUD shaders just take a reference to it, which has essentially zero cost (provided that the GPU has finished the copy). You can check that pretty easily by setting the global analyse_options=log and running frame analysis, then searching the resulting log.txt for stereo2mono:

$ grep stereo2mono FrameAnalysis-2016-11-28-150230/log.txt
000929 3DMigoto resourcewbuffer = stereo2mono o0

tip: all the features that fall under the "command list" category are logged in the same way, and are prefixed with "3DMigoto", making them easy to find.

helifax commented 7 years ago

Yeah, I had a feeling this is the case. So it boils down to inter-GPU communication and/or the amount of data required to transfer between GPUs :(

bo3b commented 7 years ago

Just a note of caution here for this conclusion. I've done a lot of performance analysis (performance tech lead on MacOS), and it's always risky to jump to conclusions without doing the tests required to back it up.

Building a working model of a given bug is really valuable as a hypothesis. But, you have to have the mental discipline to not conclude you are right before testing the hypothesis. If you do, you can spend a lot of time and effort 'fixing' something that isn't in fact the problem. In my experience doing deep level OS problems, the engineer and my best guesses were always wrong. We never picked the proper hypothesis to start, in 5 years.

In this case, I don't think we have proved that it is SLI transfer time. It's a great working model, but we still need to prove or disprove that theory.

Since DarkStarSword confirms this is SLI and DSR based problem, I will try to setup a test with that here, to get a code based profile.

Another way to test this would be to create an sizing experiment, where you try to transfer different sized texture maps to see a presumably linear/exponential performance problem the bigger they get.

DarkStarSword commented 7 years ago

Well, yeah I agree with you - but then... you never had me on your team back then ;-) Here's a smoking gun for you:

SLI, 1.78xDSR 2560 x 1440, Medium quality, High postprocessing sans motion blur, HBAO+ enabled

No stereo2mono: ~53fps full resolution (28MB) stereo2mono: ~30fps Downsampling stereo WBuffer to 1/2 resolution (7.1MB) prior to stereo2mono: ~50fps Downsampling stereo WBuffer to 1/4 resolution (1.8MB) prior to stereo2mono: ~52fps Downsampling stereo WBuffer to 1/8 resolution (451KB) prior to stereo2mono: ~53fps Downsampling stereo WBuffer to 1/16 resolution (384B) prior to stereo2mono: ~53fps

d3dx.ini:

; Original stereo WBuffer from the game:
[ResourceWBuffer]

; Downscaled versions, still stereo - this allows each GPU to downscale their
; WBuffer separately before synchronising them over SLI. We use separate
; resources for each phase so that 3DMigoto can cache them efficiently:
[ResourceWBufferHalf]
width_multiply = 0.5
height_multiply = 0.5
[ResourceWBufferQuarter]
width_multiply = 0.25
height_multiply = 0.25
[ResourceWBufferEighth]
width_multiply = 0.125
height_multiply = 0.125
[ResourceWBufferSixteenth]
width_multiply = 0.0625
height_multiply = 0.0625

; stereo2mono version - populating this can have a significant performance cost
; in SLI at high resolutions (above 1920x1080), so we downscale first:
[ResourceWBufferStereo2Mono]

[CustomShaderDownscaleWBuffer]
vs = ShaderFixes/fullscreen.hlsl
ps = ShaderFixes/downscale_half.hlsl
blend = disable
cull = none
topology = triangle_strip
ResourceWBuffer = ref o0

; Half:
ResourceWBufferHalf = copy_desc ResourceWBuffer
o0 = ref ResourceWBufferHalf
ps-t100 = ref ResourceWBuffer
draw = 4, 0

; Quarter:
ResourceWBufferQuarter = copy_desc ResourceWBuffer
o0 = ref ResourceWBufferQuarter
ps-t100 = ref ResourceWBufferHalf
draw = 4, 0

; Little benefit to continuing on my system, but we could if we wanted:
; ; 1/8th:
; ResourceWBufferEighth = copy_desc ResourceWBuffer
; o0 = ref ResourceWBufferEighth
; ps-t100 = ref ResourceWBufferQuarter
; draw = 4, 0
;
; ; 1/16th:
; ResourceWBufferSixteenth = copy_desc ResourceWBuffer
; o0 = ref ResourceWBufferSixteenth
; ps-t100 = ref ResourceWBufferEighth
; draw = 4, 0

; Uncomment the smallest we went to:
;post ResourceWBufferStereo2Mono = stereo2mono ResourceWBufferHalf
post ResourceWBufferStereo2Mono = stereo2mono ResourceWBufferQuarter
;post ResourceWBufferStereo2Mono = stereo2mono ResourceWBufferEighth
;post ResourceWBufferStereo2Mono = stereo2mono ResourceWBufferSixteenth

; Restore state:
post ps-t100 = null
post o0 = ref ResourceWBuffer

[ShaderOverrideHBAODepthPass]
Hash = 170486ed36efcc9e
post run = CustomShaderDownscaleWBuffer

And change ResourceWBuffer to ResourceWBufferStereo2Mono elsewhere in the file.

ShaderFixes/fullscreen.hlsl:

void main(
                out float4 pos : SV_Position0,
                uint vertex : SV_VertexID)
{
        // Not using vertex buffers so manufacture our own coordinates.
        switch(vertex) {
                case 0:
                        pos.xy = float2(-1, -1);
                        break;
                case 1:
                        pos.xy = float2(-1, 1);
                        break;
                case 2:
                        pos.xy = float2(1, -1);
                        break;
                case 3:
                        pos.xy = float2(1, 1);
                        break;
                default:
                        pos.xy = 0;
                        break;
        };
        pos.zw = float2(0, 1);
}

ShaderFixes/downscale_half.hlsl:

Texture2D<float4> t100 : register(t100);

void main(float4 pos : SV_Position0, out float4 result : SV_Target0)
{
        float x = pos.x * 2;
        float y = pos.y * 2;

        result  = t100.Load(float3(x + 0, y + 0, 0));
        result += t100.Load(float3(x + 1, y + 0, 0));
        result += t100.Load(float3(x + 0, y + 1, 0));
        result += t100.Load(float3(x + 1, y + 1, 0));
        result /= 4.0;
        result.w = 1;
}
DarkStarSword commented 7 years ago

You know what we need... conditional logic in the command lists so we can dynamically enable downscaling only when needed (SLI + high resolution), plus a whole bunch of other possibilities that would open up (Hell, the only reason I haven't added it already is because there's no good syntax for it that still conforms to a standard .ini file and I haven't decided how much I care about that). Even better if we could use it combined with actual performance numbers from the GPU (I'm thinking ID3D11Device::CreateQuery and co)...

helifax commented 7 years ago

Wow, very interesting find. So, if we upscale the texture we do get the resolution back. I did a quick calculus:

Full resolution (28MB) stereo2mono: ~30fps - that is basically 1400 MB of data that needs to be synced between GPU in order to maintain the 50 FPS. In practice we get 30 FPS which means we can only send 840Mb of data between GPUs.

Downsampling stereo WBuffer to 1/2 resolution (7.1MB) prior to stereo2mono: ~50fps - this means we need to send 355MB between GPU at 50 FPS which is clearly doable since it looks like we are hitting the bottleneck at 1GB (I say 1GB and not 840Mb as the rest is definitely used by the GPU to share other data).

This CLEARLY is in sync with this: (https://en.wikipedia.org/wiki/Scalable_Link_Interface#Implementation)

NVIDIA has 3 types of SLI bridges: Standard Bridge (400 MHz Pixel Clock[2] and 1GB/s bandwidth[3]) LED Bridge (540 MHz Pixel Clock[4]) High-Bandwidth Bridge (650 MHz Pixel Clock[5])

Again, 1000MB is the max we can send = 100% We try to send 1400Mb = 140% as a result we can't send 40%.

This is exactly what we see in the real life scenario: 50FPS = 100% 30 FPS = 60% we lose 40% of the performance;)

Which shows the problem is here as seen both in the MB data, and in the framerate data.

Very awesome DSS that you already implemented that one! ( I expect this version of 3DMigoto is not released yet, but when you do can I ask you to also update the Witcher 3 fix with this method? So it serve as an example for me on what I need to do for FarCry Primal;) or other games that uses this feature).

Bo3b, I get exactly what you mean! And although the numbers and theory seems to fit, I think extra profiling would be very beneficial;) Too bad there isn't a way I know of to profile the SLI bus and see if the problem is there, like we believe it is. If you know of any tools that might allow us, it would be awesome! MSI Afterburner doesn't seem to have one and I looked in NVAPI but couldn't find anything that might return me the bus load on the SLI bus :(

Thank you guys, much appreciated!

DarkStarSword commented 7 years ago

Actually, that doesn't need anything new in 3DMigoto - we've had copy_desc with width & height multiply for months, but I think only oomek used it for his non 3D mod before now - the challenge is mostly just understanding how to combine all the various features together to achieve whatever we want, but there is very little left that we can't do :)

Very interesting info on the SLI bandwidth - if we can add support for dynamically deciding whether to downsample and how much by that could provide a good target to aim for :)

I can go ahead and update Witcher 3 and FCPrimal if you like - FCPrimal is actually one of the games I was already thinking of experimenting with using downsampling to smooth out the depth buffer to see if I could make the HUD a little less jumpy. I hadn't planned to downsample the reflection buffer though - if you kill the stereo2mono in each of the [ShaderOverrideAmbient*] sections (which are the depth buffer for the HUD) leaving the ones in the reflection shaders enabled and turn the water quality setting in game down a notch or two how is your frame rate?

This also gives me an idea of how to improve the SBS shader in SLI - once I add conditional logic to the command lists I can downscale the back buffer in whichever direction suits the current setting before doing the stereo2mono to cut the bandwidth requirement in half :) Now... just got to decide what syntax I want to use for the conditional logic...

bo3b commented 7 years ago

Well, yeah I agree with you - but then... you never had me on your team back then ;-) Here's a smoking gun for you:

Heh! Oh man, it would have been awesome to have you on our team back then. :-> We had a lot of good people, but almost no one ever understood what I was talking about.


This is a good example of what I meant by making a good test case, or getting exact proof. Getting a smoking gun is the key aspect of performance profiling. In this case, as Helifax notes, our current tools, including VS profiling aren't going to show stuff at the SLI transfer layer. And since that's the working theory, it's better to do what you did here by making a specific test case.

It's really interesting that it follows a step function effect, instead of a curve for the performance. That's consistent with a hard bottleneck like the SLI bus.

I can't take a deeper look at this right now, but if I were doing the experiment, I would want to further prove the theory by using DSR to tune different resolutions, and see if I couldn't confirm the 1GB/s bottleneck, using a 1/2 size copy. i.e. use half-size copy which fixed the problem, then be able to reproduce the problem by scaling up DSR until it kicked over 1GB/s. Scaling up DSR is convenient because it's % based.

bo3b commented 7 years ago

For syntax, as a modest suggestion, maybe it makes sense to just add another operator to your copy operation?

Something like a scale_half, scale_quarter, scale_eighth .

e.g.

Hash = baz
ps-t50 = o2 unless_null scale_half

Where the idea would be that as a general fix, no one needs to worry about this, as it's a rare scenario to hit, but it's easy for an SLI user at 4K or Surround to add the scale_quarter operator to a specific known copy to solve their problem.

If possible, I'd prefer that we don't complicate the 95% case, and make the complexity an add-on where it's needed.

DarkStarSword commented 7 years ago

Do you mind if I merge the three Witcher 3 folders together taking Helifax_Update_1.22_3DM.1.2.40 to be the master? It's a little confusing having three copies separated like that - IMO the git history and released zip files should be more than enough to retrieve old versions if we need them, and we should have a single copy in master as the current work in progress that will eventually become the next release.

In my own tree I only maintain separate folders if I have both DX9 and DX11 versions of a fix since there is no overlap - if I need a slight variant of a fix for any other reason I use a git topic branch (if I need both 32bit and 64bit versions I do use subdirectories - but only for the DLLs).

Also, we should lose the precompiled DLL files - those don't belong in a source tree (in my own tree I use symbolic links that my mkrelease.sh script turns into the real thing).

bo3b commented 7 years ago

Do you mind if I merge the three Witcher 3 folders together taking Helifax_Update_1.22_3DM.1.2.40 to be the master? It's a little confusing having three copies separated like that - IMO the git history and released zip files should be more than enough to retrieve old versions if we need them, and we should have a single copy in master as the current work in progress that will eventually become the next release.

Definitely seems like the way to go. I had left the older version with the dynamic crosshair, on the off chance that someone would have preferred that over the latest version which includes SBS/TB.

The default right now in the primary fix instructions is the dynamic crosshair version 1.21, with the newer version for people who need SBS/TB.


I don't think we need to keep older versions (especially if we can regenerate them from github), unless there is something compelling about them. Like they work on older versions of drivers, or alternate game versions, and the latest version isn't a superset.

DarkStarSword commented 7 years ago

That's a good suggestion for the syntax to turn this particular custom shader into a convenience feature, but that falls under the category of providing a library of convenience functions that we can call that expand to a number of predefined custom resource and shader sections, as well as the shaders themselves, and if we want to go down that path (which I think eventually we do, but it's not a high priority for me) we should probably consider a more generic syntax that accepts any number of inputs and outputs.

But right now I was thinking more about the syntax for adding generic conditional logic to the command lists. There's been a few times I would have liked this, but so far I've always had a good alternative like moving the logic into the shaders.

For example, at the moment we ship with the run line for the SBS shader commented out so that there is no performance hit for people who don't use it. The SBS vertex shader includes some conditional logic to abort early if SBS is disabled, but by that point we have already run the stereo2mono so SLI users have already paid the price. If we had conditional logic in the command list, we could do something like:

[Present]
if (Stereo_Enabled && x7 != 0) {
    run = CustomShader3DVision2SBS
}

And for dynamically deciding whether to downsample we could do something like:

[ShaderOverrideHBAODepthPass]
Hash = 170486ed36efcc9e
if (SLI && (o0.Width * o0.Height * 4 > SLI_Bandwidth)) {
    post run = CustomShaderDownscaleWBuffer
} else {
    post ResourceWBufferStereo2Mono = stereo2mono o0
}

The new lines don't necessarily have to look like C - that's just one possibility (and possibly they should be something else so people don't try to put everything on one line - replacing the curly brackets with keywords "else" and "endif" might help ensure people don't do that), and of course the above example requires a full syntax tree which is more complicated that the first pass I was thinking of starting with (but a decent goal to work towards because of the power it offers).

But no matter what I think they could look like, I can't see any good way to make them look like a regular ini file because there is no place for an = sign in most of those lines. The current ini parsing API we use allows me to parse lines without an = sign, so that's not a problem - this really just boils down to aesthetics, and whether we care about it looking like an ini file or not (as it is the order of lines in the file is now significant in the command list sections and duplicate lines are allowed in those same sections, so in a way we are already not a pure ini file).

bo3b commented 7 years ago

There is always that tension between adding features and resulting complexity.

We can definitely add more syntax here, but it seems like we are straying farther and farther from a simple .ini file and maybe should consider dropping the weak sauce .ini parsing code for something more modern like an xml parser or whatever is best today.


For the SBS/TB in particular, that seems like a feature that is so compelling that we might be better served by making it a direct part of the code, instead of the current implementation that is leveraging your command lists.

The reason to consider that would include being able to have a simple description in the .ini like enable_SBS instead of the current uncomment the line, or equally obscure set x7=1. I get a question about this about once a week on assorted HelixModBlog pages where people can't get it working. It would also allow us to do smarter early exit checks when it's disabled, and also easier to manage SLI support.

However, the current setup nicely handles the cycling of variants, which would be weird to add directly to code.


For SLI support, if we are certain that it's a hard 1GB/s limit, that seems like something that we also would rather have directly encoded in 3Dmigoto, instead of using commands lists.

It would be a better end-user experience to have it automatically decide the optimal buffer size, which it could decide based on screen size, SLI support, and texture size. Maybe even for combined results of multiple copies required for a single frame.

Auto-sizing based on SLI bandwidth would reduce complexity for the end-user. But I might be missing other use cases you are thinking of where flexibility is more valuable.

BTW, when I say end-user, I mean the casual user just using our fixes. Not a ShaderHacker end-user of 3Dmigoto. Our casual users are already nearly stumped by the complexity of installing fixes and the various workarounds and profile setups, and .ini tweaks. Wherever possibly I'd really like to preempt those questions by making stuff native, because answering the same questions over and over is a drag.

DarkStarSword commented 7 years ago

I've been thinking this for a while, but anything we expect an end user might want to modify might be better off going into an entirely new section that we can put right at the top of the d3dx.ini, then we reference those settings by name elsewhere in the file. That solves the problem where some fixes currently ask the user to modify specific values in the [Constants] section and we can actually give them a name, and adding the conditional logic I'm proposing would also solve it for cases like the SBS shader and this downsampling.... but the shaderhacker who authored the fix would have to have included the conditional logic to make it work - it doesn't do anything to simplify the flow in the case where a fix does not include those sections already. A quick mockup of what this might look like:

[Options]
enable_sbs = true
enable_auto_hud = true
; The following option improves performance in SLI at high resolutions:
downsample_wbuffer = 2

[Constants]
y = enable_auto_hud

[Present]
if enable_sbs then
    run = CustomShader3DVision2SBS
endif

[ShaderOverrideHBAODepthPass]
hash = ...
if downsample_wbuffer then
    post run = CustomShaderDownscaleWBuffer
else
    post ResourceWBufferStereo2Mono = stereo2mono o0
endif

And this would potentially integrate quite well as a way to provide user parameters into any convenience functions we supply later on that would remove the need to ship the custom shader sections we currently do. Of course this in itself would be a bit of work to integrate into all the existing code so that it works as expected everywhere someone might try it, and not just in specific commands, and we'd probably need to reserve a bunch of names that we already use (x1, y1, etc), or use $variables instead, but it's doable... what do you think?

DarkStarSword commented 7 years ago

Oh btw - one potential problem with automatic downsampling in 3DMigoto is we can't know the most appropriate downsampling technique - here I just used an average of four pixels, which is fine for a linear WBuffer, but probably not ideal for a logarithmic ZBuffer - it will probably work well enough in most cases, but it wouldn't always be the correct result - how much that will matter in practice I'm not sure. At least something to keep in mind anyway. For a colour buffer we would ideally use the average of the pixels squared (most graphics programs get this wrong, even photoshop, so not a deal breaker if we don't), but if it is a HDR colour buffer that would be wrong... See the dilemma?

It would also be problematic for games which use a special value on the ZBuffer to indicate a pixel hasn't been drawn (although I'm not sure if I've hit that in DX11 games - I certainly hit it a lot in DX9), and averaging that might throw out the depth calculations enormously (there's one spot on the map in Miasmata where something similar to this occurs while looking out towards the horizon, which completely throws the crosshair out of whack at that location).

Also, in some cases we might not want the average - the minimum or maximum values might be more appropriate depending on what we are doing (and they have the benefit of not mattering if it is a linear or logarithmic depth buffer).

Of course, if we add downsampling as an explicit modifier to stereo2mono we could have a couple of variants. Perhaps we choose one automatically if SLI is enabled and the resolution is large enough, but allow the modifier to override that or disable downsampling.

@helifax I've updated Witcher 3 in the git repo with the downsampling code and bought it up to parity with the latest d3dx.ini template and the changes that were suggested in https://github.com/bo3b/3Dmigoto/pull/54 , but I haven't bought back the adjustment in the HUD shaders, which I guess was removed in 1.22? Just wanted to check in with you before I do anything else in case you had HUD code you wanted to use or any other changes?

BTW I'm currently working on getting 3DMigoto to be able to write the profile directly, and this is one of the main games that is intended to help... but I notice there are now two separate profiles suggested - I haven't examined the differences between them, but is there a particular reason that old drivers need the old profile?

helifax commented 7 years ago

Hi Guys,

Regarding the syntax I would go with something easy like:

post ResourceWBufferStereo2Mono = stereo2mono 0.65 o0 -> 65% size post ResourceWBufferStereo2Mono = stereo2mono 1.00 o0 -> 100% size

This way we can scale it how much we want. and people can easily just change a parameter on a line to scale it up/down. Don't know how feasible it is to implement it, but I think it would certainly be easy to use.

@DarkStarSword : I'll definitely check the Witcher 3 version and I'll try to get the old UI back;) as well as the crosshair;) I know i modified the shaders for the non-dynamic one. It shouldn't be hard to get it back to speed;) Big thank you for looking into this:)

Regarding the profile: This is the one I use now: https://forums.geforce.com/default/topic/546943/3d-vision/the-witcher-3-wild-hunt/post/5011334/#5011334

The previous one had problems with SLI and Water ripples. Except this one, either of them should work;)

Edit: I'll rework this probably in the weekend;)

On a side note: I can't see the FarCry Primal fix on Github:( I was wondering how different that one is and what I need to add to the fix to make this work;) I've looked at Witcher 3 briefly and I see a few new HLSL shaders but didn't get it how they all stick together;) Edit2: Nevermind, I saw how the new HLSL shaders are used ^_^ Cheers!

bo3b commented 7 years ago

I've been thinking this for a while, but anything we expect an end user might want to modify might be better off going into an entirely new section that we can put right at the top of the d3dx.ini, then we reference those settings by name elsewhere in the file. That solves the problem where some fixes currently ask the user to modify specific values in the [Constants] section and we can actually give them a name, and adding the conditional logic I'm proposing would also solve it for cases like the SBS shader and this downsampling.... but the shaderhacker who authored the fix would have to have included the conditional logic to make it work - it doesn't do anything to simplify the flow in the case where a fix does not include those sections already. A quick mockup of what this might look like:

This seems like a good idea to me. Having named variables in an [Options] section would simplify any number of setups. As part of that, it would also be good to switch all the remaining old school variants for bool/int to your much better wrapper functions. IIRC there are a few left.

This would really help make the ini file more readable and sensible for workarounds. With the only caveat being that a lot of the ShaderHackers will probably not use the mechanism that well. They tend to be in the get-it-to-work-bam-ship mindset. But if we provide good examples or starting templates, they'll use them.

For handling older variants, I don't think we need an exhaustive list of prior uses. The only use case I would worry about would be allowing people to drop in new dlls into an old fix. That is fairly common in order to get new features like SBS or upcoming SLI fix or auto-profile. But actually reworking the fix is fairly rare. Unless I'm missing something, it seems like x7=1, and the uncomment techniques should still work.

For things of this sort, I'd recommend taking the path of least resistance, instead of trying to make it conceptually solid. It's all a fairly large hack, so adding a bit more hack to it is OK in my book, while saving some time and testing that would be required to make it more logically consistent. As long as the use cases don't have common pain points, I'm OK with hacks. So for example, if reserving x7 type variables is a lot of work, but the chances of misuse or errors is essentially zero, then the work would be low value. You'll have a much better sense for whether it's necessary or not though, so I defer to your judgment.


For the "if" logic, I'm even less sure that would be used by other ShaderHackers, but if it solves problems for you, I think that is good enough.

It won't add everyday complexity to regular ShaderHackers, and gives you some flexibility. As long as we keep the 90% use cases simple and clear, I'm OK with added complexity.

It does seem like this is sort of at the wrong level though, it feels like that sort of stuff should really be in the HLSL code instead, but of course that's after expensive ops like the stereo2mono. So, that's just a gut feeling, and probably has to be done at this level.

The only other spot would be 3Dmigoto code itself. But if we can't come up with a general approach in code itself (like your downsampling example), then making if logic in the .ini is the needed flexibility.

Of course, if we add downsampling as an explicit modifier to stereo2mono we could have a couple of variants. Perhaps we choose one automatically if SLI is enabled and the resolution is large enough, but allow the modifier to override that or disable downsampling.

On the other hand, we can always add other 3Dmigoto code paths to handle known specific use cases, and trigger those based on simple assignments in the .ini file. I like this idea better than making general 'if' statements in the .ini, mostly because I feel like the 'if' logic will be a lot of work for low payoff.

The thing that is missing at .ini time would be system specific parameters, like screen size and SLI use. That is easy to find at 3Dmigoto code time, but maybe makes sense to provide to the .ini parser in order to make 'if' logic work. It seems like we'd need some system defined variables that ShaderHackers could use to determine which custom HLSL to run. As a general rule, we don't want people to be required to tweak the .ini for their setups. In your example:

[Options]
enable_sbs = true
enable_auto_hud = true
; The following option improves performance in SLI at high resolutions:
downsample_wbuffer = (SLI_enabled) & (screen.height > 1440)

However, I'll defer to your judgment here as well. If you think the 'if' logic is worth it, I have no strong objection.

bo3b commented 7 years ago

@helifax FarCryPrimal fix is on DarkStarSword's github instead of 3Dmigoto: https://github.com/DarkStarSword/3d-fixes/tree/master/Far%20Cry%20Primal

BTW I'm currently working on getting 3DMigoto to be able to write the profile directly, and this is one of the main games that is intended to help... but I notice there are now two separate profiles suggested - I haven't examined the differences between them, but is there a particular reason that old drivers need the old profile?

Super cool! I've been wanting to add that for a long time. This will dramatically lower the number of questions we get about fixes. Easily the biggest pain point for end-users is profile tweaking/management.

I added that second profile to the Witcher3 page, because it was not clear to me from the forum discussion whether it was a superset of the old, or required new drivers. Whatever Helifax suggests is what I'd go with.

helifax commented 7 years ago

Yes, the new "Witcher 3" profile should supersede the old one;) Is kinda the same just a few minor changes to the SLI bits and default convergence settings, if I remember right;)

Also, having the ability just to DROP a profile in Nvidia Format next to the fix and 3DMigoto automatically writing this one, would be awesome;) However, I had a problem here with the default NVAPI API. I posted them here a while ago, but last I checked it wasn't fixed:( https://forums.geforce.com/default/topic/847809/3d-vision/list-of-3d-vision-problems/post/4579546/#4579546

I don't know how much this will affect this, but is always good to make a few trial & error runs;) Maybe the predefined value doesn't matter in most cases (except CM which doesn't matter here). ^_^

bo3b commented 7 years ago

Let's move further discussion of profile setting over to: https://github.com/bo3b/3Dmigoto/issues/29

DarkStarSword commented 7 years ago

I've updated FCPrimal to drastically improve the framerate in SLI, and indeed - using max() to downscale the depth buffer did have the desired effect of making the crosshair less jumpy, so that took care of two birds with one stone :-)

FCPrimal only uses a 16bit depth buffer, so the improvements from downscaling that were far less pronounced than they were in Witcher 3, but the water reflections were a totally different matter - downscaling them made a huge difference and a further improvement was made by limiting them to once per frame (frame analysis showed that the stereo2mono operation was being performed up to 14 times a frame otherwise).

Note that thanks to driver heuristics, the downscaled reflections were created as mono resources. The way to force these to stereo may not be immediately obvious (possibly I should add an alternate method to the [Resource] sections), but the thing to remember is that since the resource section has not been completely filled in their creation is delayed until the "copy_desc" operation, so we force them to stereo at that time:

ResourceReflectionHalf = stereo copy_desc ResourceReflection
helifax commented 7 years ago

Awesome job! I did try it last evening and indeed I saw the performance improvements which are quite big in my setup;) I'll still need to make a SBS compare with the previous fix to see all the differences in the ini and how to use this feature;)

One thing I want to ask though, how did you discover the downscaled reflections are created as mono? Is the original resource stereo but when the downscaling is happening the driver makes a mono resource? I am also curious how you discovered that we need to copy them after "copy_desc" instruction.

I believe the frame analysis option can show this, but I don't know how to set it up for this :-s

DarkStarSword commented 7 years ago

The first clue in this case was that the resulting texture after the stere2mono only had an image in one eye, so one eye was seeing the reflection and the other was just seeing near black water. But, normally when a buffer is created mono it won't be so immediately apparent since usually both eyes will still see an image, though dumping out the buffer with frame analysis will show an image in only one eye.

You can also display the resource live in the game, which will make it immediately apparent if it is stereo or mono (although, note that thanks to driver heuristics the act of observing the resource may change it's decision - I've had different results depending on whether I copy a resource by reference or value, and it does not always make much sense) - in the FCPrimal d3dx.ini find this section and uncomment it:

;[CustomShaderDebug2D]
;vs = ShaderFixes\full_screen.hlsl
;ps = ShaderFixes\debug_2d.hlsl
;blend = disable
;cull = none
;topology = triangle_strip
;o0 = bb
;ps-t100 = ResourceReflectionHalf
;Draw = 4, 0
;post ps-t100 = null

And uncomment this line in the [Present] section:

;run = CustomShaderDebug2D
helifax commented 7 years ago

Big thanks for the answer;) I did play a bit with it and I can see what you are seeing;) Really awesome stuff!

Thank you for taking the time to answer it;) Much appreciated!

DarkStarSword commented 6 years ago

3DMigoto 1.3.11 now includes an SLI optimised version of the 3DVision2SBS shader that automatically downscales prior to the stereo2mono pass. Other users of stereo2mono can use the "if sli/else/endif" construct to enable their own downscaling if SLI is enabled, so I think that's all we can really do inside 3DMigoto (except for maybe providing a shortcut to make it easier to perform common types of downscaling), so I'm closing this issue.