Add support for b4_SSE2 batched mode.

johnfea commented 4 weeks ago

Description

Add support for b4_SSE2 batched mode, enabling batched execution for all x86-64 CPUs that don't support AVX.

Is there interest in adding this into OSL? I still support CPUs with vector width of 4 in my project that uses OSL and it would be nice to not to have to resort to scalar processing.

Tests

Quick tests were run to see that output with procedural and texture based materials looked ok and proper SSE2 batched code was being generated for wide/ functions. EDIT: Additionally, all tests pass now.

Checklist:

[x] I have read the contribution guidelines.
[x] I have updated the documentation, if applicable.
[x] I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
[x] My code follows the prevailing code style of this project. If I haven't already run clang-format v17 before submitting, I definitely will look at the CI test that runs clang-format and fix anything that it highlights as being nonconforming.

linux-foundation-easycla[bot] commented 4 weeks ago

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: johnfea (999774785b4958a7d115640bbd4feda85282f316)

AlexMWells commented 4 weeks ago

@johnfea thanks, I think the question (possibly unanswered) was: With many of the noise and other functions doing internal SIMD using x,y,z or r,g,b,a to take advantage of SSE would running a Batched<4> be an improvement? The Batched<4> would execute the entire shader logic and other features in SIMD, so could be a faster. If you have any feedback on that we would love to hear it.

Also please attempt to add a testing configuration that exercises batched<4> to .github/workflows/ci.yml You should see some batch-b8avx2 config you can copy.

johnfea commented 4 weeks ago

Based on a quick benchmark, batched SSE2 was 20% to 40% faster in total render speed with low complexity material when a single material filled the screen.

20% was with a single 3D fBm and 40% was with mix of fBm and two different procedural shaders.

It seems definitely an improvement.

There was at least two maybe related issues, however. A string from a shader didn't end up with correct ustringhash in output closure and its string representation was empty string. Also, backfacing() in a shader didn't work correctly anymore.

AlexMWells commented 3 weeks ago

@johnfea great to hear it is speeding things up. Curious how Batched<8> does with AVX on the same workloads vs Batched<4> with SSE.

As far as backfacing() not working, it should just pull a varying value out of the BatchShaderGlobals<4>::varing.backfacing. So make sure you are populating it with value values before executing the shadernetwork. Not sure this would affect this but take a look at: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/pull/1827

As far as the closures, once you have added a b4_SSE2 workflow to the .github/workflows/ci.yml you can craft a failing unit test that reproduces the issue, you can update this pull request and we could take a look at the failing CI.

johnfea commented 3 weeks ago

b4_SSE2 ran at 60% the speed of b8_AVX2 mode with single 3D fBm when measured in total render time, but for this measurement all parts of rendering were switched and not just OSL part. b4_SSE2 ran at 55% the speed of b8_AVX2 mode with mix of fBm and two different procedural shaders.

I updated the pull request with failing closure test unit, but I've no idea how to make the tests run so it probably won't run without some changes. Since testshade didn't seem to read output closures and testrender doesn't support batched, I added new testminimal. This issue doesn't happen with printf from a shader like done in some testshade tests.

It seems the backfacing() issue occurs with batched mode at any width but it doesn't affect output. When using backfacing() as input to a mix shader, non-batched mode doesn't output the cancelled shader in closures, but batched mode does with zero weights, which is missed optimization.

Also, the issue reproduced by this unit test is also generic to batched mode like #1801 and #1800 and not just with SSE2.

johnfea commented 2 weeks ago

I've updated the pull request, now "closure-string" test fails in batched mode only and for some reason also for non-batched icc/icx. This is the only test that fails for b8_AVX2 but for b4_SSE2 there's also "Invalid bitcast" <4x i1> to i8.

johnfea commented 2 weeks ago

I removed closure-string and testminimal from this PR and added those into a separate PR #1831.

Now for b4_SSE2 only "'Invalid bitcast' <4x i1> to i8" should fail in tests and b8_AVX2 should pass.

johnfea commented 1 week ago

All tests for this PR pass now.

I'm not sure why AVX512 needs cases for width 8 in src/liboslexec/llvm_util.cpp and why AVX needs cases for width 16. I tried removing them and EDIT: it made no difference for either b16 or b8 in tests. Anyway, I added cases in those places too for width 4. There's an alternative branch here with all of those cases removed. EDIT2: Apparently, b8 AVX512 is a path in liboslexec CMakeList.txt but b16 AVX is not. My take would be to not add any b4 paths/cases for AVX512 or AVX, but retain the existing non-typical width cases for AVX512 and AVX.

lgritz commented 1 day ago

Did you close on purpose?

johnfea commented 1 day ago

No, I was just trying to fix CLA check failing.

AcademySoftwareFoundation / OpenShadingLanguage