anholt / mesa

this repo is dead. See https://gitlab.freedesktop.org/mesa/mesa master branch for latest usable vc4 and v3d, and https://gitlab.freedesktop.org/anholt/mesa for old vc4/v3d WIP branches
118 stars 40 forks source link

vc4_validate_shader_recs issues in Stellarium? #62

Open gzotti opened 7 years ago

gzotti commented 7 years ago

Hello!

I have updated to the current NOOBS Raspbian (Mesa 13.0.0). Closed issue #26 is fixed perfectly, and Stellarium runs at about 18fps (on a Pi3). However, in some viewing directions the screen suddenly turns completely white, Stderr reports "Draw call returned Das Argument is ungültig [invalid argument]. Expect corruption." Indeed, white screen.

Dmesg reports error groups [drm: vc4_validate_shader_recs [vc4]] ERROR UBO clamp would allow reads outside of UBO". [drm] Texture p0 at 176: 0x000ab00a [drm] Texture p1 at 180: 0x20040040 [drm] Texture p2 at -1: 0x00000000 [drm] Texture p3 at -1: 0x00000000

I am not familiar with low-level driver details. Is this something we (Stellarium) could avoid, or is it a driver issue? It works on all other platforms, so I guess it's the latter. Or what could we do here?

Kind regards, gzotti, Stellarium team (http://stellarium.org)

anholt commented 7 years ago

Pure driver issue. I'll need to get an apitrace of stellarium when the problem occurs and then figure out what's going wrong with optimization that we're not meeting the kernel's validation requirements.

gzotti commented 7 years ago

Thanks. Now, I would need instructions about apitrace (command line details?), or I try to explain how to reproduce.

Install stellarium via apt-get. (the usual repository has a slightly outdated version, but it's enough to reproduce. It requires lots of qt5 packages.) Then start and press F5 for a time panel. I think the issue occurs more frequently when the sun is above horizon, set some useful afternoon hour. Then just drag view around with the mouse and cursor. Possibly it has to do with the sun coming close to screen center, but it usually triggers far before that, and I am not really sure which limits there are. Either it makes white screen, or some stuttering behaviour. In any case, clearly misbehaving, and dmesg comes in both cases with the same message.

Kind regards, gzotti

anholt commented 7 years ago

Hmm, didn't reproduce with Mesa/kernel master. Checking out back to Mesa 13 now.

anholt commented 7 years ago

Didn't reproduce on Mesa 13.0.0 either. Mesa 13 predates FS threading, so the kernel version shouldn't matter, either, unless you're as far back as the kernel not having branching. Does VC4_DEBUG=qpu stellarium dump instructions, and some of those instructions are branch.all_zs or branch.all_zc?

I'm using stellarium 0.15.0-1 from debian -- is this the same version from raspbian?

gzotti commented 7 years ago

Raspbian has only 0.13.1 and older Qt (5.3.2). But the issue appears if I build from sources, so should be here still as of 0.15.1.

uname -a returns Linux raspberrypi 4.4.38-v7+ #938 SMP Thu Dec 15 15:22:21 GMT 2016 armv7l GNU/Linux

glxinfo starts with: name of display: :0.0 libGL error: MESA-LOADER: failed to retrieve device information MESA-LOADER: failed to retrieve device information MESA-LOADER: failed to retrieve device information display: :0 screen: 0 direct rendering: Yes server glx vendor string: SGI server glx version string: 1.4 server glx extensions: ...

and further down OpenGL vendor string: Broadcom OpenGL renderer string: Gallium 0.4 on VC4 V3D 2.1 OpenGL version string: 2.1 Mesa 13.0.0 OpenGL shading language version string: 1.20 OpenGL extensions:...

This is a NOOBS installation with just apt-get dist-upgrades done.

VC4_DEBUG=gpu does not add anything to output :-( I had also tried to build apitrace, but build failed. Had no time for more last weekend.

Now I just thought I cannot reproduce any longer myself (sun right in screen center), but with sun to the right, again I have whiteout. It seems it does have to do with the planets (or rather, most likely, the Sun) displayed. When I disable them in the settings, I have no problem.

gzotti commented 7 years ago

It might have to do with the moon, not the sun! Is it a problem for the VC4 driver when a color or light brightness value in the shader gets >1? We do that deliberately for a slight overexposure effect.

anholt commented 7 years ago

Successfully looked at the moon today.

I guess VC4_DEBUG wouldn't work because you're on a release build of Mesa. gl_FragColor > 1 should be just fine -- it gets clamped to [0,1] by the GL 2 spec and the HW.

gzotti commented 7 years ago

OK, if you can see the moon on screen, the issue may have been fixed already, it is just not in Raspbian yet. I was pretty sure now that the whiteout appears as soon as the moon should come onto screen. Thanks anyways for looking into this. Now I hope updates propagate soon to the users...

ldslaron commented 7 years ago

I'm seeing a similar issue running Raspian Pixel on a Pi3 using a built Stellarium 0.90.0.9217. I can confirm that the display apparently starts to jitter when the moon comes into view. On the latest download of Stellarium the screen turns white instead.

I see this error in the console output: Initializing planets GL shaders... Draw call returned Invalid argument. Expect corruption.

uname -a returns: Linux raspberrypi 4.4.50-v7+ #970 SMP Mon Feb 20 19:18:29 GMT 2017 armv7l GNU/Linux

Any additional thoughts? Thanks!

ldslaron commented 7 years ago

Please find an apitrace of the error scenario attached.

stellarium.zip

gzotti commented 7 years ago

Thanks Idslaron. It seems anholt has fixed the issue already (he can see the moon on a Debian build, so the shader problem seems fixed), but these updates have not yet arrived in "regular" Raspbian. I wait for the next release of Raspbian (Stretch) before I can invest my time again.

ldslaron commented 7 years ago

Hi, gzotti. I'm wondering whether the problem might be manifest only under certain conditions. I'm probably missing something - anholt can correct me if I'm wrong - but in vc4_validate.c:reloc_tex(), it looks as though p0 and p1 are compared directly to 32-bit addr/size values. Outside the if statement, however, p0 and p1 are treated as bitfields. Is this to be expected? Could the UBO clamp error you noted from the dmesg output be a false positive resulting in an early abort? I am certainly not an expert here but was curious about this code.

As a sidenote, the mask (without shift) to extract the local var offset also struck me as odd since offset appears to start at bit 12.

gzotti commented 7 years ago

I am far from being a GPU driver expert, and never actually tried to look into the VC4 code to fix this myself. My observation is that whenever a "Planet" object comes into view and triggers a certain shader to be used, we have this jittering image with the dmesg messages. The dmesg messages may be false positives of something else which causes the jitter, I cannot say this. The moon is the largest such object which always gets rendered with that shader. when zooming in to the other planets to become large enough to be rendered as spheres, the effect also appears. Sorry, I have no time in these months to follow this, install Debian, build Mesa from sources, ..., esp. when anholt reports he can see the moon on screen without problems. We would like to support this platform, but as users. I hope Stretch will include all fixes (and some Qt5.7.x).

anholt commented 7 years ago

(FWIW, I've been waiting on this bug, assuming that when the submitters update Mesa it'll resolve itself)

gzotti commented 7 years ago

Yes, OK. I am also waiting for the next official update. And no time in these weeks.

gzotti commented 7 years ago

I just tried a quick test with an update of Raspbian (fresh NOOBS install) to Stretch following http://rrobek.de/rasendehimbeere.html. It includes Qt5.7.1 and Stellarium 0.15.0. This basically works at over 20fps (much faster than my netbook!), but the flicker bug is still there, with the same errors in dmesg (but no error on stderr). Logfile identifies Mesa 13.0.6/Gallium 0.4 on VC4 V3D 2.1. EGLFS limitation wants it that task switch or leaving full-screen leads to exiting X11. This may be out of your control though. I also experienced a few display freezes when resizing other Windows (e.g. Chromium).

gzotti commented 7 years ago

I just tried with current NOOBS (Early July, 2017). This is still Raspbian Jessie. uname-a gives Linux raspberrypi 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l GNU/Linux

After installing Stellarium (0.13.1 from Raspbian repo, which installs the required qt 5.3.2 libs), I still see the mentioned bug. All it takes to trigger is a zoom onto any planet so that it would be rendered as textured sphere.

glxinfo reports Gallium 0.4 on VC4 V3D 2.1, Mesa 13.0.0/OpenGL 2.1, GLSL 1.20.

It's depressing, Eric has solved it by early February, and after at least two kernel updates and new Raspbian releases it has still not arrived at the users. :-( Which version would include the bugfix?

gzotti commented 6 years ago

Some news: Seeing that Debian and with it Raspbian keep Mesa 13 for some longer, I followed your instructions here to build libdrm and Mesa (17.4.0-devel). Finally this issue can be declared fixed!

Two new observations, not sure whether they should go into new issues or we keep in this "Raspberry/Stellarium" issue: (1) On Raspbian and latest Stellarium (built from sources, sorry! At least Qt5.7 is here now.), activating 3D OBJ planets (to show irregular shaped moons) throws the same error on stderr: Draw call returned invalid argument. Expect corruption.

In dmesg, I have torrents of repeating entries, in fact our old "friend" :-( [ 133.940995] [drm:vc4_validate_shader_recs [vc4]] ERROR UBO clamp would allow reads outside of UBO [ 133.941007] [drm] Texture p0 at 1148: 0x00000000 [ 133.941010] [drm] Texture p1 at 1156: 0x40040090 [ 133.941013] [drm] Texture p2 at 1160: 0x00000000 [ 133.941015] [drm] Texture p3 at -1: 0x00000000 [ 134.111992] [drm:vc4_validate_shader_recs [vc4]] ERROR UBO clamp would allow reads outside of UBO [ 134.112002] [drm] Texture p0 at 1148: 0x00000000 [ 134.112007] [drm] Texture p1 at 1156: 0x40040090 [ 134.112009] [drm] Texture p2 at 1160: 0x00000000 [ 134.112012] [drm] Texture p3 at -1: 0x00000000 [ 134.267381] [drm:vc4_validate_shader_recs [vc4]] ERROR UBO clamp would allow reads outside of UBO [ 134.267392] [drm] Texture p0 at 1148: 0x00000000 [ 134.267395] [drm] Texture p1 at 1156: 0x40040090 [ 134.267397] [drm] Texture p2 at 1160: 0x00000000 [ 134.267400] [drm] Texture p3 at -1: 0x00000000

To replicate, use F3 to search for e.g. Deimos, zoom in, and see Mars and its 2 moons as spheres. Then activate 3D shapes ("Use more accurate 3D models") in the view dialog (F4).

This bug is at least not a showstopper for general use of the program (just switch off OBJ planets...)

(2) I also tried Ubuntu Mate, updated to 16.04.3 which has Mesa 17.0.7, followed https://ubuntu-mate.community/t/tutorial-activate-opengl-driver-for-ubuntu-mate-16-04/7094 to activate VC4, and tried Stellarium 0.16.1 from our ppa.

It seems much faster (>10fps vs. about 6fps?) than the self-built Mesa on Raspbian, but maybe Mesa needs some compiler settings?

Sometimes on high zoom and with DSS display (online image loading) it crashed with stuck system, likely because of memory issues. I increased gpu_mem in config.txt to 256, this seemed to help. But I still have crashes which the system survives, dmesg reports: [ 313.382375] vc4-drm soc:gpu: failed to allocate buffer with size 1048576 [ 313.382406] [drm:vc4_bo_create [vc4]] ERROR Failed to allocate from CMA: [ 313.382410] [drm] kernel: 5744kb BOs (1) [ 313.382413] [drm] V3D: 185556kb BOs (450) [ 313.382416] [drm] V3D shader: 516kb BOs (119) [ 313.382420] [drm] dumb: 16kb BOs (1) [ 313.382423] [drm] binner: 16384kb BOs (1) [ 313.382426] [drm] RCL: 16kb BOs (1) [ 313.382428] [drm] BCL: 4kb BOs (1) [ 313.422557] vc4-drm soc:gpu: failed to allocate buffer with size 352256

Maybe we want severely too much texture memory? What is the most useful memory split, should I assign 512MB to the GPU? Could there be a more graceful behaviour, like simply not loading texture?

Interestingly, on Ubuntu Mate, the message flood mentioned in issue (1) only has the shorter

[ 1775.867438] [drm] Texture p0 at 1148: 0x00000000 [ 1775.867445] [drm] Texture p1 at 1160: 0x40040090 [ 1775.867448] [drm] Texture p2 at 1164: 0x00000000 [ 1775.867452] [drm] Texture p3 at -1: 0x00000000 [ 1775.949725] [drm] Texture p0 at 1148: 0x00000000 [ 1775.949733] [drm] Texture p1 at 1160: 0x40040090 [ 1775.949736] [drm] Texture p2 at 1164: 0x00000000 [ 1775.949738] [drm] Texture p3 at -1: 0x00000000 [ 1776.034051] [drm] Texture p0 at 1148: 0x00000000 [ 1776.034060] [drm] Texture p1 at 1160: 0x40040090 [ 1776.034062] [drm] Texture p2 at 1164: 0x00000000 [ 1776.034065] [drm] Texture p3 at -1: 0x00000000 [ 1776.119124] [drm] Texture p0 at 1148: 0x00000000 [ 1776.119131] [drm] Texture p1 at 1160: 0x40040090 [ 1776.119135] [drm] Texture p2 at 1164: 0x00000000 [ 1776.119137] [drm] Texture p3 at -1: 0x00000000

anholt commented 6 years ago

installing current stellarium debs to give the new reproducing instructions a shot, and if that fails then I'll try from source.

Don't assign any memory to the GPU in config.txt; the 3D engine doesn't get any of it. You just want as much CMA as you can have (I use 384MB, and some of the downstream kernels do that as well). We do throw GL_OUT_OF_MEMORY on texture allocations that fail, but there's not a whole lot else we can do. (We could try to swap things out to non-CMA memory, but that's going to be brutally slow, probably to the point that just exiting the app is better)

gzotti commented 6 years ago

Thanks for the note on gpu_mem not being used by your driver, I have found a note on this just after writing above notes. Can I configure gpu_mem=0 or is there some minimum required (I have read some minimum should be kept for video decoding?), and I'd better keep 16 or 64? Various forums give diverse instructions on gpu_mem, and some say CMA can be at most 256? I also increased swap space to an insensible 1GB. While terribly slow if used, it at least avoids an out-of-memory situation.

So for assigning CMA, what would be the optimal parameter in cmdline.txt? (Maybe you can add these notes to your wiki page on VC4 or performance?)

GL_OUT_OF_MEMORY should be fine, we should be able to handle this then.