Replace tooled system with faster system (HLBVH) for BoT ray tracing

ianebeckett commented 4 months ago

Traditional ray tracing takes a long time to shoot. ADRT takes a long time to build. This implementation reduces both shot and build times (for most of the models we tested) with HLBVH compared to ADRT. This implementation replaces both traditional ray tracing and ADRT.

starseeker commented 4 months ago

Two regression tests won't pass:

The following tests FAILED:
    934 - regress-bots (Failed)
    985 - regress-licenses (Failed)

The regress-bots test is reporting off by one failures, which isn't terribly surprising given the nature of this change... the regress-licenses test can be fixed though (cut_hlbvh.h also needs a footer as well as a header):

FILE /home/cyapp/temp/brlcad/src/librt/cut_hlbvh.h has no info

tr-tamu commented 4 months ago

Two regression tests won't pass:
The following tests FAILED:
  934 - regress-bots (Failed)
  985 - regress-licenses (Failed)
The regress-bots test is reporting off by one failures, which isn't terribly surprising given the nature of this change... the regress-licenses test can be fixed though (cut_hlbvh.h also needs a footer as well as a header):
FILE /home/cyapp/temp/brlcad/src/librt/cut_hlbvh.h has no info

Thank you! We have added barebones license information in ee9fb6f25309187d063dc5b64621bdbdb8e60d03.

starseeker commented 4 months ago

Digging a little more into the regress-bots failure, I've attached a comparison image showing the differences in shading. The visual difference when inspecting the output images is fairly subtle, but it would be good to understand the reason for the difference. diff

For this sort of thing, it may come down to stepping a ray through the process in the old and new codes in a debugger and seeing where the math of the answers changes. It can be helpful to have a simple input in such cases, and a quick test indicates a facetized arb4 primitive (4 triangles) will also show the differences:

diff

tr-tamu commented 4 months ago

Is this happening on just the float version of the codebase? Or does this happen on both float and double?

The reason I ask is because the old float implementation in traditional rt used signed chars to store normal information. We are using full floats to store normal information, which could lead to the banding effect seen on the triangle and sphere faces.

starseeker commented 4 months ago

I used the default compile behaviors for both images, so the control image is default rt behavior from a build of the main branch, and the test image was the default behavior of the HLBVH branch.

tr-tamu commented 4 months ago

If you wouldn't mind, what models are you using? Raytracing the small sphere.g model, pixdiff is giving me this comparison image. using "[hlbvh/release]/bin/rt -B -s 1024 -o SphereSmall.png SphereSmall.g SphereSmall" (we broke out the components of spheres.g to automate testing)

starseeker commented 4 months ago

https://brlcad.org/~starseeker/arb4.g https://brlcad.org/~starseeker/sph_test.g

My test platform is a fairly recent Ubuntu Linux.

starseeker commented 4 months ago

I'm using an image diffing tool based off of libicv's diff, which is still experimental and may be reporting more aggressively in the image than pixdiff (I've not yet done a rigorous validation of it.)

starseeker commented 4 months ago

Based on your image, it looks like there is a difference in the grazing hit behavior?

starseeker commented 4 months ago

Just to be clear - minor shading differences are not necessarily showstoppers, but we want to understand why they are there so we can offer an explanation if anyone does notice.

tr-tamu commented 4 months ago

So from what investigation I've been able to do yesterday and today, there are multiple differences in the setup before we get into shooting rays. Traditional rt calculates the same aabb that hlbvh does, but modifies it in rt_bot_prep_pieces to be further away from zero by 5/10,000. Unsure if this is related to or because of the former, but a_ray is also different between traditional rt and hlbvh. r_pt seems to be off by roughly 3 units in magnitude, with traditional rt being larger.

Again, I'm unsure if this could account for the discrepancies in the visualizations, but it seems reasonable that it might.

brlcad commented 4 months ago

So from what investigation I've been able to do yesterday and today, there are multiple differences in the setup before we get into shooting rays. Traditional rt calculates the same aabb that hlbvh does, but modifies it in rt_bot_prep_pieces to be further away from zero by 5/10,000. Unsure if this is related to or because of the former, but a_ray is also different between traditional rt and hlbvh. r_pt seems to be off by roughly 3 units in magnitude, with traditional rt being larger.

Again, I'm unsure if this could account for the discrepancies in the visualizations, but it seems reasonable that it might.

Can you give an example of what you mean by r_pt being off by 3 units in magnitude?

As for the 5/10000, that is the default calculation tolerance (0.0005), which is the maximal extent a value can fluctuate before being "different" or "wrong". So bot_preppieces*() nudges the bounding boxes +- that tolerance in order to ensure we never miss a triangle due to minuscule floating point differences (which can happen just due to instruction reordering). Not accounting for that would indeed cause the silhouette effect your image showed with grazing rays. Imagine a ray that just barely goes through a grazed triangle vertex. If triangle vertices or edges are on a face of the AABB, we might entirely skip over even testing it for intersection because we don't hit the box just because of how the math works out, but could've/would've/should've (numerically) hit the triangle nonetheless. The common fix for that is to nudge bounding boxes by a specified distance tolerance during prep, then no hits should ever get missed.

tr-tamu commented 4 months ago

a_ray.r_pt on hlbvh is {514.68058007134368, 162.15413328074072, 345.41312303825464}. a_ray.r_pt on trad rt is {516.73930239162905, 162.80274981386353, 346.79477553040761}.

brlcad commented 4 months ago

a_ray.r_pt on hlbvh is {514.68058007134368, 162.15413328074072, 345.41312303825464}. a_ray.r_pt on trad rt is {516.73930239162905, 162.80274981386353, 346.79477553040761}.

Do you know if those are primary or secondary rays?

tr-tamu commented 4 months ago

So we were able to get the pixdiff on the sphereical models to be absolutely similar. The command I used for testing was parallel --tag --jobs $(ls | grep "sphere" | grep "\.g$" | wc -l) '../debug/bin/rt -B -s 1024 -c "viewsize 850;" -o {}Hlbvh.pix {}.g {} && ~/Git/brlcad/debug/bin/rt -B -s 1024 -c "viewsize 850;" -o {}Main.pix {}.g {} && ../release/bin/pixdiff {}Main.pix {}Hlbvh.pix > {}Diff.pix' ::: $(ls | grep "sphere" | grep "\.g$" | sed "s/\.g//") 2>&1 | tee rtspherediff.log and the file produced was rtspherediff.log

starseeker commented 4 months ago

@tr-tamu Just to be sure - that diff succeeded with the code currently in this pull request?

tr-tamu commented 4 months ago

Yes. It also succeeded with the arb4.g model - I wasn't able to reproduce the striations that you got, so there might be some kind of disconnect.

starseeker commented 4 months ago

I've been working on isolating a single-ray test case, and I've come up with the following. Using the moss.g from http://brlcad.org/~starseeker/moss.g that contains a facetization "all.bot", I run the following nirt commands with main and the hlbvh branch on Ubuntu linux, and observe a (small) delta in the output. I will attach a couple text files showing the two sessions, to allow for exact reproduction.

starseeker commented 4 months ago

Of potential interest - if I turn backout on (the default for nirt) the results are identical.

starseeker commented 4 months ago

OK, text files corrected with full sequence:

nirt_main.txt nirt_hlbvh.txt

tr-tamu commented 3 months ago

Thank you for creating an easily reproducible test case. The first thing I notice is that the difference is extremely small, less than one trillionth of a millimeter. The second is that both main and hlbvh are off from the backout result by exactly the same amount, but in different directions. These things together lead me to believe this is the result of floating point rounding.

$ hlbvh/bin/nirt -s moss.g all.bot -f diff -e "units mm;xyz -22.63664822825653999 -33.06808385906127512 -88.63070794182381462;dir 0.00000000000000000 0.00000000000000000 1.00000000000000000;backout 0;s;q" | tail -n 1 | cut -d"," -f 12-14 43.36619415028438596,-43.36619415028438596,22.93002633002313928

$ main/bin/nirt -s moss.g all.bot -f diff -e "units mm;xyz -22.63664822825653999 -33.06808385906127512 -88.63070794182381462;dir 0.00000000000000000 0.00000000000000000 1.00000000000000000;backout 0;s;q" | tail -n 1 | cut -d"," -f 12-14 43.36619415028441438,-43.36619415028441438,22.93002633002316770

$ main/bin/nirt -s moss.g all.bot -f diff -e "units mm;xyz -22.63664822825653999 -33.06808385906127512 -88.63070794182381462;dir 0.00000000000000000 0.00000000000000000 1.00000000000000000;s;q" | tail -n 1 | cut -d"," -f 12-14 43.36619415028440017,-43.36619415028440017,22.93002633002313928

The nirt values out are, in order, in x, y, z, in distance, out x, y, z, out distance, los, scaled los(?), oblique in, and oblique out

These three values are, in order, the z out, the out distance, and the los (length of segment). This ray hits 4 triangles, 27, 310, 219, and 236. It the difference is in triangle 236. Looking through the code for nirt, the difference starts in the z value for out, and then cascades into the out distance and los.

The origin of this difference comes from r_pt. Looking at bot_piece_shot_double (for main) and bot_shot_hlbvh_flat (for hlbvh), r_pt is the same in x and y dimensions. The z dimension in main is 38.046803513291266 and the z dimension in hlbvh is -23.99949999999998. Tracing back this difference as well, it comes from the ray being advanced through the top-level grid in rt_shootray in main, while this isn't happening in rt_shootray in hlbvh.

In conclusion, this seems to be a minor floating-point difference that stems from the fact that main is advancing through the global grid partitioning while hlbvh is not advancing through the global grid partitioning.

starseeker commented 1 month ago

@tr-tamu In the regress-bots test file bots.g in the build directory, are you able to generate an image with the sph.volume.lh.bot model? I'm getting a blank rendering here.

starseeker commented 1 month ago

Built using the last SHA1 in the PR before I did the main merge: https://github.com/ianebeckett/brlcad/tree/600b5c559e0aef05508fe914c33baf1e1f0f1ddd

starseeker commented 1 month ago

Oh, I should explain - I merged this pull request to get the changes onto main, since we were starting to get conflicts. I backed out the changes after the merge in a single commit, to allow the backout to be easily reverted (i.e. for the hlbvh changes to be restored.) Commit f5f01498684 is the backout in main, so reverting that will put all the hlbvh bits back.

starseeker commented 1 month ago

Got lh BoTs to raytrace: e4901e4

BRL-CAD / brlcad

Replace tooled system with faster system (HLBVH) for BoT ray tracing #126