curv3d / curv

a language for making art using mathematics
Apache License 2.0
1.14k stars 73 forks source link

Black squares in output. #78

Closed rewolff closed 5 years ago

rewolff commented 5 years ago

On the Nouveau graphics driver with GF119 [GeForce GT 610 as the hardware, I get black boxes in the output. It happens when the normal of the surface is perpendicular to the viewing angle.

The boxes are 4x8 pixels (4 wide, 8 high). I don't do drag-n-drop. So I can't attach a picture here. I've uploaded it. http://prive.bitwizard.nl/curv_black_boxes.png

doug-moen commented 5 years ago

Hypothesis:

image

rewolff commented 5 years ago

Hypothesis: It looks to me as if the core-group runs into a division-by-zero because the normal is perpendicular to the ray. (say: "what is the tan of the angle that the ray hits the pixel?" is answered by "infinity" or a division by-zero. In the example above, you can see that where mathematically a precisely perpendicular ray would exist on each edge pixel, only a few get the division-by-zero: This is because the actual ray is likely to miss the precise perpendicular point.

doug-moen commented 5 years ago

The math is done in floating point. Division by zero on a GPU doesn't cause an exception, it produces either Infinity or NaN values, and the computation keeps running. There's nothing that the Nouveau driver can do to change this behaviour. The Nvidia proprietary driver doesn't produce these glitches, and nobody has reported this problem before. So we are looking for a mechanism that is specific to Nouveau.

Those glitches appear in places where the sphere-tracing algorithm that I use for ray-marching will normally use an increased number of iterations. We can test my hypothesis once I make those constants configurable.

doug-moen commented 5 years ago

I just added two new command line options:

curv -Oray_max_iter=200 -Oray_max_depth=400 ...

You can try reducing ray_max_iter from its default value of 200 to see if the glitches go away in the rainbow.curv model. This value directly limits the number of iterations in the ray-marching loop.

ray_max_depth indirectly limits the number of iterations. If the ray has travelled more than 400 units (or whatever value you supply), then the loop terminates early. I don't think this one will fix the glitches in rainbow.curv. However, reducing the value may speed up rendering and reduce glitching in some other shapes.

rewolff commented 5 years ago

Thanks! Pulled, compiled and tested. Starting from the default you show above, neither of the two parameters seems to do anything. I tested 2x more and 2x less.

Edit: Ah! the "change by a factor of two" was not aggressive enough. 80: no change. 40: Only two black blobs. 20: no blobs. (5 is the "normal" number of blobs for the rainbow-cylinder that I use to test).

doug-moen commented 5 years ago

You tried -Oray_max_iter=100.

Now try -Oray_max_iter=50, -Oray_max_iter=25, and smaller...

rewolff commented 5 years ago

yeah. With 40 "some" are fixed, but at 20 they are all fixed and render correctly. So 40 is "on the edge" and 20 is enough....

rewolff commented 5 years ago

Testing with: rainbow.curv, counting only the black blocks on one side of the cylinder:

0-19 does not render correctly (part of the cylinder is missing) . 20-37 renders correctly 38 renders with one black block. 39-42 renders with two black blocks. 43 renders with three black blocks 44 renders with four black blocks 45 - 200 renders with five black blocks.

doug-moen commented 5 years ago

All GPU drivers render by partitioning the viewport into tiles, and rendering the pixels within each tile in parallel. Multiple tiles are also rendered in parallel, depending on how many cores you have.

In Curv, the time required to compute a tile can vary greatly. Background tiles are usually very fast. Certain tiles, like the rounded edges of the rainbow cylinder, can be slow. There could easily be a 50 to 1 or 100 to 1 difference in rendering times between tiles, depending on the shape, but if the slow tiles are rare, then you still get fast average tile rendering times, and the user can't tell the difference.

The Nouveau driver appears to impose a hard limit on the rendering time of each tile. It is the slow tiles that are turning black, and we can eliminate the problem by speeding up the ray marcher. I would guess that Nouveau attempts to guarantee 30 frames per second, based on the assumption that all tiles take the same time to render, and imposes a hard time limit based on these assumptions. If the slowest tile in a Curv program is required to meet this deadline, then the net effect is as if the GPU is 10 or 50 times slower than it actually is.

There is at least one more simple trick for getting a bit more performance out of the ray marcher, but no easy to way to 10x or 50x more performance. I think that Nouveau is not suitable for use with Curv, and I recommend installing the Nvidia proprietary GPU driver.

doug-moen commented 5 years ago

I couldn't find a clear explanation why Nouveau works this way. No other Mesa based GPU driver has this "black rectangle" bug.

But, we do know that Nouveau suffers performance problems because Nvidia is blocking the Nouveau project from doing thermal management. (Those APIs are blocked, due to a requirement for digitally signed firmware on some hardware models, and due to implied legal threats if they reverse engineer the proprietary driver.) This means that Nouveau must be careful to avoid doing anything that would cause your GPU to overheat and become damaged. This is consistent with my theory that Nouveau aborts a SIMD group if it runs too long.

I looked to see if there is a way of disabling the "black rectangle" behaviour, but I couldn't find anything.

rewolff commented 5 years ago

So, now we have a "workaround in curv" and possibly a demonstration case, I think it is time to report this as a bug in Nouveau. When I find that 20-37 "renders correctly" I suspect that this holds true for the very simple case of the colored cylinder and not for more complex objects, and that "> 100" is likely required for realistic 3D-printable objects. That's why you set the default here to 200.

For my understanding: you have a "ray_max_depth" that says how far from the viewpoint the rays can be broken off. This explains why some things that look infinite seem to have an end, but when you move the viewpoint the actual end stays just as far as the ray depth is measured from the camera position. Right?

doug-moen commented 5 years ago

A suggestion from the Nouveau bug tracker is to use this environment variable:

export LIBGL_ALWAYS_SOFTWARE=1 

This will disable the Nouveau GPU driver and use software rendering of OpenGL calls instead (meaning the work is done on the CPU). The results may be unacceptably slow, but there should be no rendering artifacts.

This is not a serious or practical suggestion, due to the loss of rendering performance, but I'm including it for completeness of the historical record.

doug-moen commented 5 years ago

The Nouveau driver is not supported until this issue is resolved upstream. I think that it isn't just a simple bug fix, that instead Nvidia will need to change their corporate policy and support the Nouveau project, before the issue can be resolved.

rewolff commented 5 years ago

Might I make a suggestion? The "not supported" means to you: "won't work without issues". When I first read that I interpreted it as: "you don't stand a chance of getting that to work".

I think the difference is important: I almost gave up on "giving curv a test-run" because of your "not supported" status. While in fact it is quite usable, if you know that the black rectangles are a rendering artifact.

Getting people to test-drive curv and subsequently interested in curv works both ways: With a bit of luck someone might fix the nouveau bugs that cause this issue, or maybe someone fixes it by modifying curv in such a way that the nouveau issues no longer occur.

doug-moen commented 5 years ago

Here's sort of good news, a way to work around the Nouveau driver bug. But in the end, it's still easier and safer to just install the Nvidia proprietary driver.

More information about the Nouveau bug:

The main blocker for making the open-source NVIDIA driver viable for Linux desktop users and gamers though is re-clocking support for newer generations of hardware... With the GeForce GTX 900 Maxwell series and newer, there isn't yet any re-clocking support so graphics cards are stuck to operating at their boot frequencies, which generally is quite low compared to their rated base/boost clock frequencies. Until NVIDIA releases the signed PMU firmware or the Nouveau developers achieve a workaround, any GPUs newer than the GTX 600/700 Kepler or GTX 750 Maxwell series is a no-go if you want decent performance. It's not known if/when a solution will be in place for better supporting these newer generations of NVIDIA GPUs.

Thus for now the best Nouveau open-source driver support remains with the GTX 600/700 Kepler series since at least there the graphics card can be manually re-clocked by writing a value to DebugFS... Still no automatic/dynamic re-clocking, but at least users can force their Kepler (and GTX 750 Maxwell1) parts to the rated frequencies.

And here is the official Nouveau web site.

It looks like the "black squares" performance problem can be mitigated by "manual reclocking", at least on the older pre-GTX-900 GPUs that support this. This is a risky procedure that involves setting nouveau.pstate=1 (for kernels earlier than 4.5) and then writing a value to /sys/... (the path is kernel dependent). How to do this? The Nouveau wiki provides some information:

WARNING: Power management is a very experimental feature and is not expected to work. If you decided to upclock your GPU, please acknowledge that your card may overheat. Please check the temperature of your GPU at all time!

Raising the card performance mode might help. Ask on IRC, #nouveau channel, how to do that. Instructions are not given here, because in the worst case, it may destroy your card, because power management is still a work in progress.

Phoronix provides more helpful instructions: https://www.phoronix.com/scan.php?page=news_item&px=linux-4.5-nouveu-pstate-howto

I don't recommend following this procedure. It's far less difficult, and far less risky, to install the Nvidia proprietary driver. And you'll get better results than with the Nouveau driver + reclocking.

doug-moen commented 5 years ago

@rewolff

Might I make a suggestion? The "not supported" means to you: "won't work without issues". When I first read that I interpreted it as: "you don't stand a chance of getting that to work".

I updated the GPU requirement section of the README with better wording and more information. Thanks for the suggestion.