Xenosaga Episode I: audio / video lag

orbea commented 8 years ago

OS: Slackware64-current Linux-4.6.2 GeForce GTX 780 Ti (GK110B) AMD FX-6350 pcsx2-0857902_2016.06.09_master-x86_64-1_git GSdx (GCC 5.3.0, AVX) 1.1.0 [libGSdx] mesa-compat32-7d7e015_2016.06.09_master-x86_64-1_gitcompat32 xf86-video-nouveau-1da8a93_2016.06.02_master-x86_64-1_git apitrace-compat32-7.1-x86_64-2_SBocompat32

When playing Xenosaga Episode I: Der Wille zur Macht in the second cut scene (Still before the game play starts) just as the character Kos-mos is introduced for the first time the game experiences heavy video and audio lagging / stuttering. The issue persists for a while and then dissipates just to return again, this pattern repeats. In the past this was far worse and has since improved, but not enough yet to be considered playable.

In the opening tutorial the game lags more the closer your approach Kos-mos and less the farther you move your character away. This issue reoccurs more or less throughout the game including map screens, battle and cut scenes.

Things I have tried:

Occurs with both OpenGL and software renderers, but not as much with sw.
GS user hacks mostly make no difference, Safe Accurate Blending however makes it much worse.
Skipdraw of 100 does not make a difference either.

Apitrace: Cut scene - http://ks392457.kimsufi.com/orbea/stuff/trace/PCSX2-Xenosaga-cutscene-bad.trace.xz Tutorial - http://ks392457.kimsufi.com/orbea/stuff/trace/PCSX2-Xenosaga-tutorial-bad.trace.xz I suspect there is something wrong with either mesa, apitrace and / or PCSX2 since when I replay the apitrace I get a lot of new graphical issues and a lot of Mesa: User error: GL_INVALID_OPERATION in glValidateProgramPipeline failed to validate the pipeline. The second trace looks especially amusing / broken... In the past when I made traces with PCSX2 and Xenosaga they replayed normally. Edit: apitrace replay --benchmark will replay the traces properly.

Multiple frame GS dump: Cut scene - http://ks392457.kimsufi.com/orbea/stuff/games/pcsx2/snaps/gsdx_Xenosaga_cutscene_20160611125248.gs.xz Tutorial - http://ks392457.kimsufi.com/orbea/stuff/games/pcsx2/snaps/gsdx_Xenosaga_tutorial_20160611130210.gs.xz

Savestates: Cut scene - http://ks392457.kimsufi.com/orbea/stuff/games/pcsx2/sstates/SLUS-20469%20(6D1276AB).00.p2s.xz Tutorial - http://ks392457.kimsufi.com/orbea/stuff/games/pcsx2/sstates/SLUS-20469%20(6D1276AB).01.p2s.xz

Is there any other info I can help to debug this?

FlatOutPS2 commented 8 years ago

Are you sure this isn't just your system not being able to handle these more demanding scenes?

orbea commented 8 years ago

Yes and no, its probably a mixture of Xenosaga being coded poorly combined with nouveau being imperfect. I have a gtx 780 TI (GK110B) card so if it is going to work at all with nouveau it should work with my card (nvidia drivers should not be considered a solution).

After watching the discussion between Gregory and some nouveau devs at #nouveau I figured reporting this issue in full (Rather than mentioning it on irc only) might help when optimizing PCSX2 more?

Edit: Could it also be a CPU limitation? I have an AMD FX-6350.

ramapcsx2 commented 8 years ago

You need a very fast CPU for these scenes. The game has several of them but the one in the intro is a good example. You can lower the accuracy options in GSdx for a slight speedup but for full speed, I'd recommend an Intel dual core or better at 3.2Ghz or better.

orbea commented 8 years ago

That might be nice for some games, but unfortunately I am not sure that is a financial reality when my motherboard has an AM3+ restriction...

Do you think this issue is worth keeping open for further optimizations? Or should it just be closed as an hardware limitation?

gregory38 commented 8 years ago

I will give it a look when I reboot my computer to nouveau. But not fast CPU + not fully optimized driver == likely bad performance.

orbea commented 8 years ago

That is fully understandable, but I would think that anything that can be done to improve a demanding game like Xenosaga with the given scenario would go a long way towards improving perf with PCSX2 in general? Not to mention it would be greatly appreciated. :)

gregory38 commented 8 years ago

Safe Accurate Blending draw primitive by primitive so yes it is slow. It is a debug option. The apitrace replay error is due to a Mesa bug => https://bugs.freedesktop.org/show_bug.cgi?id=96358

Unfortunately, improving the speed on a game barely translate to others game. However a fast driver will help everyone ;) I'm currently looking how I can multithread GSdx. But it is weeks of work ! Texture upload is slow on the free driver. On a game that does lots of transfer it is 87 (nvidia) vs 48 (nouveau) fps. So I'm not sure MT will help if I'm stalling to upload a texture.

Could you try those option to improve speed

disable accurate blending
disable geometry shader ( @ramapcsx2 by the way it provides a +20% speed boost on shadow hearts)
Use CRC level to full

orbea commented 8 years ago

I have tried these:

I disabled Blending Unit Accuracy, but does not have much impact.
Forcing Geometry Shader as off has a minor speed up.
Changing the CRC level does not seem to have any noticeable impact.

With all of this the speed up is minor at best.

However there is some noticeable difference if I disable the Geometry Shader and use the software renderer.

gregory38 commented 8 years ago

Fwiw, on nvidia (cpu at 3Ghz), native resolution

MT off GS on tutorial = 107 fps cutscene = 52 fps

MT off GS off tutorial = 105 fps cutscene = 52 fps

MT on GS off tutorial = 230 fps cutscene = 117 fps

MT on GS on tutorial = 228 fps cutscene = 106 fps

So yeah, Geometry Shader isn't helpful here. It would be interesting to understand when GS is useful. Potentially it could be replaced by https://www.opengl.org/registry/specs/NV/fill_rectangle.txt in future hardware. Anyway, as you see MT can provide an interesting speed boost. When I reboot I will do the test on nouveau so value can be compared apple to apple.

gregory38 commented 8 years ago

Oh by the way, geometry shader are only used in HW. So it mustn't have any impact for the SW renderer.

orbea commented 8 years ago

MT?

I found I could measure fps in PCSX2 if I started it with GALLIUM_HUD="fps" PCSX2.

Non-affected areas, frame limiting off tutorial = 50+ fps cutscene = 130+ fps

Default settings tutorial = 22 fps cutscene = 11 fps

Blending Accuracy off, CRC full, GS off tutorial = 24 fps cutcscene = 13 fps

sw renderer tutorial = 50 fps cutscene = 25 fps

gregory38 commented 8 years ago

Nvidia supports a Multi Thread optimization. What do you mean by non-affected area.

orbea commented 8 years ago

That is the normal (Playable) fps when there is no lagging. For the cutscene its the several minutes proceeding the savestate and the tutorial its the edge of the map away from the npc Kos-mos. The issue is not always present, it will come and go.

orbea commented 8 years ago

So I tried a few new things to further improve performance.

First I enabled DRI3.

$ cat /etc/X11/xorg.conf.d/nouveau-dri3.conf 
Section "Device"
   Identifier  "780Ti"
   Driver      "nouveau"
   Option      "DRI" "3"
EndSection

This was able to improve fps in the cutscene by 2-3.

Next I installed the nouveau_4.5_reclocking branch for the kernel. https://github.com/karolherbst/linux/tree/nouveau_4.5_reclocking https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-Kernel-Spin-RC Edit: I used the wrong nouveau_reclocking branch, this is the current one, it does not make much difference in perf though (Use the Makefile in drm). https://github.com/karolherbst/nouveau/tree/stable_reclocking_kepler_v5

Default settings tutorial = 34 fps cutscene = 13-14 fps

Blending Accuracy off, CRC full, GS off tutorial = 35 fps cutcscene = 14-15 fps

Strangely enough the lagged part of the cutscene does not last nearly as long anymore. Now it only lasts for 2 lines of dialogue before speeding up to 55+ fps and then slowing down again after ~1 minute. This does not seem to change based on PCSX2 options, but I'm not sure if this is due to the change in kernel or DRI3 yet.

gregory38 commented 8 years ago

Could you try with/without the 8 bit texture option?

orbea commented 8 years ago

I do not enable that by default, it seems to make the game a little slower.

With: tutorial = 29 fps cutscene = 9 fps (41 after the first two lines)

Without: tutorial = 30 fps cutscene = 12 fps (47 after two lines)

orbea commented 6 years ago

@gregory38 This issue is still very much present for me, could you discuss this with the nouveau developers and see if there is anything they could do to improve this? :)

gregory38 commented 6 years ago

What will help is GL multithreading

Require DRI3 (technically not required but easier). Or potentially the EGL backend of GSdx could be enough.
Some patches (I send them nearly 1 year ago) to allow threading of texture transfer with PBO (I should check Mesa status)

orbea commented 6 years ago

pcsx2-2018.02.08_c23241c_master-x86_64-1_git mesa-2018.02.10_831fb29_master-x86_64-1_git

So I have DRI3 enabled and rebuilt PCSX2 with -DEGL_API=ON, I'm getting around 22 fps with full nouveau reclocking. I also tried with mesa_glthread=true and it did not make much difference.

gregory38 commented 6 years ago

Ok. I think it might need my patch to allow asynchronous PBO (aka texture upload) transfer.

orbea commented 6 years ago

Do you have a link for this patch?

gregory38 commented 6 years ago

Check around, there are 3 patches and it is maybe not latest version. Anyway it might require a rebase. Note I don't know if someone did something for recent mesa

https://patchwork.freedesktop.org/patch/168988/

orbea commented 6 years ago

Rebasing the patches was pretty straight forward and then I squashed it as well. This seems to allow mesa_glthread to work fine with PCSX2 which includes a performance boost, especially in the non-affected areas which are now well beyond full speed. However this doesn't really help that much in the affected areas where I now get around 26 fps instead of 22.

To me it seems like its hitting some really slow path in nouveau, is this a possible hypothesis?

diff --git a/src/mapi/glapi/gen/ARB_direct_state_access.xml b/src/mapi/glapi/gen/ARB_direct_state_access.xml
index fa90ffcdd2..5629c639c1 100644
--- a/src/mapi/glapi/gen/ARB_direct_state_access.xml
+++ b/src/mapi/glapi/gen/ARB_direct_state_access.xml
@@ -374,7 +374,7 @@
       <param name="fixedsamplelocations" type="GLboolean" />
    </function>

-   <function name="TextureSubImage1D" no_error="true">
+   <function name="TextureSubImage1D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -384,7 +384,7 @@
       <param name="pixels" type="const GLvoid *" />
    </function>

-   <function name="TextureSubImage2D" no_error="true">
+   <function name="TextureSubImage2D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -396,7 +396,7 @@
       <param name="pixels" type="const GLvoid *" />
    </function>

-   <function name="TextureSubImage3D" no_error="true">
+   <function name="TextureSubImage3D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -410,7 +410,7 @@
       <param name="pixels" type="const GLvoid *" />
    </function>

-   <function name="CompressedTextureSubImage1D" no_error="true">
+   <function name="CompressedTextureSubImage1D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -420,7 +420,7 @@
       <param name="data" type="const GLvoid *" />
    </function>

-   <function name="CompressedTextureSubImage2D" no_error="true">
+   <function name="CompressedTextureSubImage2D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -432,7 +432,7 @@
       <param name="data" type="const GLvoid *" />
    </function>

-   <function name="CompressedTextureSubImage3D" no_error="true">
+   <function name="CompressedTextureSubImage3D" no_error="true" marshal="upbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="xoffset" type="GLint" />
@@ -523,7 +523,7 @@
       <param name="texture" type="GLuint" />
    </function>

-   <function name="GetTextureImage">
+   <function name="GetTextureImage" marshal="ppbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="format" type="GLenum" />
@@ -532,7 +532,7 @@
       <param name="pixels" type="GLvoid *" />
    </function>

-   <function name="GetCompressedTextureImage">
+   <function name="GetCompressedTextureImage" marshal="ppbo">
       <param name="texture" type="GLuint" />
       <param name="level" type="GLint" />
       <param name="bufSize" type="GLsizei" />
diff --git a/src/mapi/glapi/gen/ARB_robustness.xml b/src/mapi/glapi/gen/ARB_robustness.xml
index 1f6ac4696b..5924545677 100644
--- a/src/mapi/glapi/gen/ARB_robustness.xml
+++ b/src/mapi/glapi/gen/ARB_robustness.xml
@@ -82,7 +82,7 @@
         <param name="img" type="GLvoid *" output="true"/>
     </function>

-    <function name="ReadnPixelsARB" no_error="true">
+    <function name="ReadnPixelsARB" no_error="true" marshal="ppbo">
         <param name="x" type="GLint"/>
         <param name="y" type="GLint"/>
         <param name="width" type="GLsizei"/>
diff --git a/src/mapi/glapi/gen/gl_API.dtd b/src/mapi/glapi/gen/gl_API.dtd
index b464250777..447b03a41d 100644
--- a/src/mapi/glapi/gen/gl_API.dtd
+++ b/src/mapi/glapi/gen/gl_API.dtd
@@ -122,14 +122,16 @@ param:
         offset data should be padded to the next even number of dimensions.
         For example, this will insert an empty "height" field after the
         "width" field in the protocol for TexImage1D.
-     marshal - One of "sync", "async", "draw", or "custom", defaulting to
-        async unless one of the arguments is something we know we can't
-        codegen for.  If "sync", we finish any queued glthread work and call
+     marshal - One of "sync", "async", "draw", "ppbo", "upbo" or "custom",
+        defaulting to async unless one of the arguments is something we know we
+        can't codegen for.  If "sync", we finish any queued glthread work and call
         the Mesa implementation directly.  If "async", we queue the function
         call to be performed by glthread.  If "custom", the prototype will be
         generated but a custom implementation will be present in marshal.c.
         If "draw", it will follow the "async" rules except that "indices" are
-        ignored (since they may come from a VBO).
+        ignored (since they may come from a VBO). If "ppbo"/"upbo", it will
+        follow the "async" rules when a pack/unpack pixel buffer is bound
+        otherwise it will follow the "sync" rules.
      marshal_fail - an expression that, if it evaluates true, causes glthread
         to switch back to the Mesa implementation and call it directly.  Used
         to disable glthread for GL compatibility interactions that we don't
diff --git a/src/mapi/glapi/gen/gl_API.xml b/src/mapi/glapi/gen/gl_API.xml
index 38c1921047..b03bd087fb 100644
--- a/src/mapi/glapi/gen/gl_API.xml
+++ b/src/mapi/glapi/gen/gl_API.xml
@@ -2149,7 +2149,7 @@
         <glx rop="108"/>
     </function>

-    <function name="TexImage1D" no_error="true">
+    <function name="TexImage1D" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="internalformat" type="GLint"/>
@@ -2161,7 +2161,7 @@
         <glx rop="109" large="true"/>
     </function>

-    <function name="TexImage2D" es1="1.0" es2="2.0" no_error="true">
+    <function name="TexImage2D" es1="1.0" es2="2.0" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="internalformat" type="GLint"/>
@@ -2640,7 +2640,7 @@
         <glx rop="172"/>
     </function>

-    <function name="ReadPixels" es1="1.0" es2="2.0" no_error="true">
+    <function name="ReadPixels" es1="1.0" es2="2.0" no_error="true" marshal="ppbo">
         <param name="x" type="GLint"/>
         <param name="y" type="GLint"/>
         <param name="width" type="GLsizei"/>
@@ -3299,7 +3299,7 @@
         <glx rop="4122"/>
     </function>

-    <function name="TexSubImage1D" no_error="true">
+    <function name="TexSubImage1D" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="xoffset" type="GLint"/>
@@ -3311,7 +3311,7 @@
         <glx rop="4099" large="true"/>
     </function>

-    <function name="TexSubImage2D" es1="1.0" es2="2.0" no_error="true">
+    <function name="TexSubImage2D" es1="1.0" es2="2.0" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="xoffset" type="GLint"/>
@@ -4011,7 +4011,7 @@
         <glx rop="4113"/>
     </function>

-    <function name="TexImage3D" es2="3.0" no_error="true">
+    <function name="TexImage3D" es2="3.0" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="internalformat" type="GLint"/>
@@ -4025,7 +4025,7 @@
         <glx rop="4114" large="true"/>
     </function>

-    <function name="TexSubImage3D" es2="3.0" no_error="true">
+    <function name="TexSubImage3D" es2="3.0" no_error="true" marshal="upbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="xoffset" type="GLint"/>
@@ -4507,7 +4507,7 @@
         <glx rop="229"/>
     </function>

-    <function name="CompressedTexImage3D" es2="3.0" marshal="sync"
+    <function name="CompressedTexImage3D" es2="3.0" marshal="upbo"
               no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
@@ -4521,7 +4521,7 @@
         <glx rop="216" handcode="client"/>
     </function>

-    <function name="CompressedTexImage2D" es1="1.0" es2="2.0" marshal="sync"
+    <function name="CompressedTexImage2D" es1="1.0" es2="2.0" marshal="upbo"
                no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
@@ -4534,7 +4534,7 @@
         <glx rop="215" handcode="client"/>
     </function>

-    <function name="CompressedTexImage1D" marshal="sync" no_error="true">
+    <function name="CompressedTexImage1D" marshal="upbo" no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="internalformat" type="GLenum"/>
@@ -4545,7 +4545,7 @@
         <glx rop="214" handcode="client"/>
     </function>

-    <function name="CompressedTexSubImage3D" es2="3.0" marshal="sync"
+    <function name="CompressedTexSubImage3D" es2="3.0" marshal="upbo"
               no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
@@ -4561,7 +4561,7 @@
         <glx rop="219" handcode="client"/>
     </function>

-    <function name="CompressedTexSubImage2D" es1="1.0" es2="2.0" marshal="sync"
+    <function name="CompressedTexSubImage2D" es1="1.0" es2="2.0" marshal="upbo"
               no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
@@ -4575,7 +4575,7 @@
         <glx rop="218" handcode="client"/>
     </function>

-    <function name="CompressedTexSubImage1D" marshal="sync" no_error="true">
+    <function name="CompressedTexSubImage1D" marshal="upbo" no_error="true">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="xoffset" type="GLint"/>
@@ -4586,7 +4586,7 @@
         <glx rop="217" handcode="client"/>
     </function>

-    <function name="GetCompressedTexImage">
+    <function name="GetCompressedTexImage" marshal="ppbo">
         <param name="target" type="GLenum"/>
         <param name="level" type="GLint"/>
         <param name="img" type="GLvoid *" output="true"/>
@@ -5062,7 +5062,7 @@
         <glx ignore="true"/>
     </function>

-    <function name="DeleteBuffers" es1="1.1" es2="2.0" no_error="true">
+    <function name="DeleteBuffers" es1="1.1" es2="2.0" no_error="true" marshal="custom">
         <param name="n" type="GLsizei" counter="true"/>
         <param name="buffer" type="const GLuint *" count="n"/>
         <glx ignore="true"/>
diff --git a/src/mapi/glapi/gen/gl_marshal.py b/src/mapi/glapi/gen/gl_marshal.py
index 6a2c0d7b31..0f5487b46b 100644
--- a/src/mapi/glapi/gen/gl_marshal.py
+++ b/src/mapi/glapi/gen/gl_marshal.py
@@ -170,6 +170,7 @@ class PrintCode(gl_XML.gl_print_base):
              'const struct marshal_cmd_{0} *cmd)').format(func.name))
         out('{')
         with indent():
+            flavor = func.marshal_flavor()
             for p in func.fixed_params:
                 if p.count:
                     p_decl = '{0} * {1} = cmd->{1};'.format(
@@ -178,7 +179,11 @@ class PrintCode(gl_XML.gl_print_base):
                     p_decl = '{0} {1} = cmd->{1};'.format(
                             p.type_string(), p.name)

-                if not p_decl.startswith('const '):
+                # Data is written to destination user pointer (if no PBO
+                # are bound) therefore the pointer must be non-const
+                nonconst_pointer = p.is_pointer() and flavor == 'ppbo'
+
+                if not p_decl.startswith('const ') and not nonconst_pointer:
                     # Declare all local function variables as const, even if
                     # the original parameter is not const.
                     p_decl = 'const ' + p_decl
@@ -252,6 +257,21 @@ class PrintCode(gl_XML.gl_print_base):
                     out('return;')
                 out('}')

+            flavor = func.marshal_flavor()
+            if flavor == "upbo":
+                need_fallback_sync = True
+                out('if (!ctx->GLThread->bound_pixel_unpack_buffer) {')
+                with indent():
+                    out('goto fallback_to_sync;')
+                out('}')
+
+            if flavor == "ppbo":
+                need_fallback_sync = True
+                out('if (!ctx->GLThread->bound_pixel_pack_buffer) {')
+                with indent():
+                    out('goto fallback_to_sync;')
+                out('}')
+
             out('if (cmd_size <= MARSHAL_MAX_CMD_SIZE) {')
             with indent():
                 self.print_async_dispatch(func)
@@ -332,7 +352,7 @@ class PrintCode(gl_XML.gl_print_base):
             flavor = func.marshal_flavor()
             if flavor in ('skip', 'custom'):
                 continue
-            elif flavor == 'async':
+            elif flavor == 'async' or flavor == 'ppbo' or flavor == 'upbo':
                 self.print_async_body(func)
                 async_funcs.append(func)
             elif flavor == 'sync':
diff --git a/src/mapi/glapi/gen/marshal_XML.py b/src/mapi/glapi/gen/marshal_XML.py
index d761e58ce8..aece64a789 100644
--- a/src/mapi/glapi/gen/marshal_XML.py
+++ b/src/mapi/glapi/gen/marshal_XML.py
@@ -44,21 +44,32 @@ class marshal_function(gl_XML.gl_function):
         if element.get('name') != self.name:
             return

+        # Store the "marshal" attribute, if present.
+        self.marshal = element.get('marshal')
+        self.marshal_fail = element.get('marshal_fail')
+
         # Classify fixed and variable parameters.
         self.fixed_params = []
         self.variable_params = []
         for p in self.parameters:
             if p.is_padding:
                 continue
-            if p.is_variable_length():
+            if self.marshal == "upbo" and p.is_pointer():
+                # There are some texture upload functions (the
+                # CompressedTexImage family) which provide an imageSize together
+                # with the image pointer. The default marshaling behavior is to
+                # copy the contents of image to a variable length section, which
+                # is incorrect when image refers to a pixel buffer.
+                #
+                # This ensures that the pointer is marshaled as is. The non-PBO
+                # case is always synchronous and so does not use marshaling
+                # anyway.
+                self.fixed_params.append(p)
+            elif p.is_variable_length():
                 self.variable_params.append(p)
             else:
                 self.fixed_params.append(p)

-        # Store the "marshal" attribute, if present.
-        self.marshal = element.get('marshal')
-        self.marshal_fail = element.get('marshal_fail')
-
     def marshal_flavor(self):
         """Find out how this function should be marshalled between
         client and server threads."""
diff --git a/src/mesa/main/glthread.h b/src/mesa/main/glthread.h
index 306246ca1c..e5c8b79a97 100644
--- a/src/mesa/main/glthread.h
+++ b/src/mesa/main/glthread.h
@@ -95,6 +95,16 @@ struct glthread_state
     * buffer) binding is in a VBO.
     */
    bool element_array_is_vbo;
+
+   /**
+    * Tracks on the main thread side the bound unpack pixel buffer
+    */
+   GLuint bound_pixel_unpack_buffer;
+
+   /**
+    * Tracks on the main thread side the bound pack pixel buffer
+    */
+   GLuint bound_pixel_pack_buffer;
 };

 void _mesa_glthread_init(struct gl_context *ctx);
diff --git a/src/mesa/main/marshal.c b/src/mesa/main/marshal.c
index 8f8e8c78ed..285a46457d 100644
--- a/src/mesa/main/marshal.c
+++ b/src/mesa/main/marshal.c
@@ -32,6 +32,7 @@
 #include "marshal.h"
 #include "dispatch.h"
 #include "marshal_generated.h"
+#include "hash.h"

 struct marshal_cmd_Flush
 {
@@ -195,6 +196,45 @@ _mesa_marshal_ShaderSource(GLuint shader, GLsizei count,
 }

+static void track_buffers_destruction(struct gl_context *ctx,
+      GLsizei n, const GLuint * buffers)
+{
+   GLsizei i;
+   struct glthread_state *glthread = ctx->GLThread;
+
+   if (n < 0 || !buffers)
+      return;
+
+   for (i = 0; i < n ; i++) {
+      if (buffers[i] == glthread->bound_pixel_pack_buffer)
+         glthread->bound_pixel_pack_buffer = 0;
+
+      if (buffers[i] == glthread->bound_pixel_unpack_buffer)
+         glthread->bound_pixel_unpack_buffer = 0;
+   }
+}
+
+/* DeleteBuffers: custom marshal to track buffers destruction */
+void GLAPIENTRY
+_mesa_marshal_DeleteBuffers(GLsizei n, const GLuint * buffer)
+{
+   GET_CURRENT_CONTEXT(ctx);
+   _mesa_glthread_finish(ctx);
+   debug_print_sync("DeleteBuffers");
+
+   // It is done before CALL_DeleteBuffers to avoid any ABA multithread issue.
+   track_buffers_destruction(ctx, n, buffer);
+
+   CALL_DeleteBuffers(ctx->CurrentServerDispatch, (n, buffer));
+}
+
+void
+_mesa_unmarshal_DeleteBuffers(struct gl_context *ctx,
+                           const struct marshal_cmd_DeleteBuffers *cmd)
+{
+   assert(0);
+}
+
 /* BindBufferBase: marshalled asynchronously */
 struct marshal_cmd_BindBufferBase
 {
@@ -204,7 +244,20 @@ struct marshal_cmd_BindBufferBase
    GLuint buffer;
 };

-/** Tracks the current bindings for the vertex array and index array buffers.
+/**
+ * Check that buffer is a valid buffer handle
+ * Always return false for ID 0.
+ */
+static bool
+is_bufferobj(struct gl_context *ctx, GLuint buffer)
+{
+   if (buffer == 0)
+      return false;
+   else
+      return _mesa_HashLookup(ctx->Shared->BufferObjects, buffer) != NULL;
+}
+
+/** Tracks the current bindings of GL buffer targets
  *
  * This is part of what we need to enable glthread on compat-GL contexts that
  * happen to use VBOs, without also supporting the full tracking of VBO vs
@@ -226,9 +279,14 @@ struct marshal_cmd_BindBufferBase
  * instead of updating the binding.  However, compat GL has the ridiculous
  * feature that if you pass a bad name, it just gens a buffer object for you,
  * so we escape without having to know if things are valid or not.
+ *
+ * Pixel buffers are tracked to decide whether pixel transfer goes to a user
+ * pointer (must be synchronous) or a GL buffer (can be asynchronous). Unlike
+ * for VBOs, we do need accurate tracking, since user pointers can be used in
+ * GL core contexts.
  */
 static void
-track_vbo_binding(struct gl_context *ctx, GLenum target, GLuint buffer)
+track_buffers_binding(struct gl_context *ctx, GLenum target, GLuint buffer)
 {
    struct glthread_state *glthread = ctx->GLThread;

@@ -243,6 +301,18 @@ track_vbo_binding(struct gl_context *ctx, GLenum target, GLuint buffer)
        */
       glthread->element_array_is_vbo = (buffer != 0);
       break;
+   case GL_PIXEL_UNPACK_BUFFER:
+      if (ctx->API == API_OPENGL_COMPAT || is_bufferobj(ctx, buffer))
+         glthread->bound_pixel_unpack_buffer = buffer;
+      else
+         glthread->bound_pixel_unpack_buffer = 0;
+      break;
+   case GL_PIXEL_PACK_BUFFER:
+      if (ctx->API == API_OPENGL_COMPAT || is_bufferobj(ctx, buffer))
+         glthread->bound_pixel_pack_buffer = buffer;
+      else
+         glthread->bound_pixel_pack_buffer = 0;
+      break;
    }
 }

@@ -274,7 +344,7 @@ _mesa_marshal_BindBuffer(GLenum target, GLuint buffer)
    struct marshal_cmd_BindBuffer *cmd;
    debug_print_marshal("BindBuffer");

-   track_vbo_binding(ctx, target, buffer);
+   track_buffers_binding(ctx, target, buffer);

    if (cmd_size <= MARSHAL_MAX_CMD_SIZE) {
       cmd = _mesa_glthread_allocate_command(ctx, DISPATCH_CMD_BindBuffer,
diff --git a/src/mesa/main/marshal.h b/src/mesa/main/marshal.h
index 63e0295576..e7f681213c 100644
--- a/src/mesa/main/marshal.h
+++ b/src/mesa/main/marshal.h
@@ -187,6 +187,7 @@ struct marshal_cmd_ClearBuffer;
 #define marshal_cmd_ClearBufferiv   marshal_cmd_ClearBuffer
 #define marshal_cmd_ClearBufferuiv  marshal_cmd_ClearBuffer
 #define marshal_cmd_ClearBufferfi   marshal_cmd_ClearBuffer
+struct marshal_cmd_DeleteBuffers;

 void
 _mesa_unmarshal_Enable(struct gl_context *ctx,
@@ -281,4 +282,11 @@ void GLAPIENTRY
 _mesa_marshal_ClearBufferfi(GLenum buffer, GLint drawbuffer,
                             const GLfloat depth, const GLint stencil);

+void GLAPIENTRY
+_mesa_marshal_DeleteBuffers(GLsizei n, const GLuint * buffer);
+
+void
+_mesa_unmarshal_DeleteBuffers(struct gl_context *ctx,
+                           const struct marshal_cmd_DeleteBuffers *cmd);
+
 #endif /* MARSHAL_H */

gregory38 commented 6 years ago

I don't remember well what the game does but I think it doesn't fly on Nvidia too. The game likely changes state (or maybe uniforms) every draws. Surely mesa/nouveau misses some optimizations.

gregory38 commented 6 years ago

Note: potentially gsdx could be improved too

orbea commented 6 years ago

There was something done somewhere a long while ago that removed some of the slowdowns, but not all of them. What steps could be done to identify what the specific issue is?

gregory38 commented 6 years ago

The slowdown was about the emulation of the bit blending of the alpha channel. Initial code requires to draw primitive 1 by 1. However alpha channel isn't blended, just written so we could do it with 1 flush + 1 draw. Potentially the slowdown is still linked to this issue, because it triggers various state change. In particular I'm afraid that it uploads new uniform to GPU every draw.

What steps could be done to identify what the specific issue is?

Look into an apitrace to spot bad pattern, for example uniform upload/new program every draw (in short frequent state change).... Check a perf report to find where cpu cycle are spent. Well if it is a CPU issue.

rcaridade145 commented 6 years ago

Wonder if using NIR ( https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-NIR-V2 ) NV50_PROG_USE_NIR=1 would make a difference.

orbea commented 6 years ago

I suspect the bottleneck is related to nouveau. When starting PCSX2 with this command.

mesa_glthread=true GALLIUM_HUD=fps,cpu,API-thread-offloaded-slots+API-thread-direct-slots+API-thread-num-syncs PCSX2

I found that fps is around 130 and cpu usage is pretty stable around 30. As soon as the first slowdown occurs the fps plummets and the cpu usage increases to around 40. This lasts about two lines of dialogue until the fps increases to around 70 and the cpu usage jumps to around 60. This lasts a while longer until it slows down again and returns to previous values.

Using NV50_PROG_USE_NIR=1 doesn't seem to make a difference.

gregory38 commented 6 years ago

I know that glsl compiler isn't on par with Nvidia (didn't test a recent version) but this scene is typically CPU/driver dependent.

orbea commented 5 years ago

I upgraded GPUs to a Radeon RX Vega 56 using amdgpu and gained around 15 fps. The first slow section is now around 45 fps, I guess this further shows this issue is CPU bound.

orbea commented 5 years ago

@gregory38 I tested this again and am now getting around 20 fps instead of the 45 fps I recorded in November 2018.

With perf top I am noticing excessive usage with memcpy so I replaced instances of memcpy with memmove in first pcsx2 which revealed a smaller second bottleneck (Not yet investigated) and then in mesa which eventually revealed this line.

https://github.com/mesa3d/mesa/blob/66a82ec6f0fa3586fecee001da6bae1fc33f12d1/src/gallium/auxiliary/util/u_threaded_context.c#L1652

    21.19%  libc-2.29.so            [.] memcpy
     1.83%  [processor]             [k] init_module
     1.56%  libpthread-2.29.so      [.] __pthread_mutex_lock
     1.06%  libpthread-2.29.so      [.] __pthread_mutex_unlock
     0.98%  libc-2.29.so            [.] memcmp
     0.83%  [kernel]                [k] __switch_to
     0.74%  [kernel]                [k] deactivate_slab.isra.0
     0.73%  libpthread-2.29.so      [.] sem_post@@GLIBC_2.1
     0.70%  libc-2.29.so            [.] _int_malloc
     0.68%  [kernel]                [k] entry_INT80_compat

After replacing that line with memmove pcsx2 gains 10 additional fps for around a total of 30.

     3.33%  [processor]         [k] init_module
     3.22%  libc-2.29.so        [.] memcpy
     1.43%  libpthread-2.29.so  [.] __pthread_mutex_lock
     1.33%  libpthread-2.29.so  [.] __pthread_mutex_unlock
     1.16%  libc-2.29.so        [.] memcmp
     1.03%  libc-2.29.so        [.] memmove
     0.94%  libpthread-2.29.so  [.] sem_post@@GLIBC_2.1
     0.90%  libm-2.29.so        [.] __floorf
     0.82%  libc-2.29.so        [.] _int_free
     0.81%  [kernel]            [k] entry_INT80_compat

Could this reveal more clues about what causes these slowdowns? I am mostly recording this here so its not forgotten.

orbea commented 5 years ago

I asked about this in #radeon @ freenode and was linked this issue.

https://bugs.freedesktop.org/show_bug.cgi?id=107670

After reading it I recompiled glibc in Slackware with --enable-multi-arch and i686 instead of i586. The performance jumped from 20 fps to 40.

PCSX2 / pcsx2

Xenosaga Episode I: audio / video lag #1403