AcademySoftwareFoundation / OpenColorIO

A color management framework for visual effects and animation.
https://opencolorio.org
BSD 3-Clause "New" or "Revised" License
1.79k stars 455 forks source link

GPU unit test failures on M1 Macs #1754

Closed doug-walker closed 5 months ago

doug-walker commented 1 year ago

Currently, a few of the GPU unit tests are failing on M1 Macs. This task is to investigate and fix them.

On initial inspection, most of the problems are due to results being only very slightly outside the existing tolerances but in other cases there are bigger errors that may be the result of differences in NaN handling, or other issues.

The same failures happen in both the OpenGL and Metal tests.

doug-walker commented 1 year ago

Update: the failures were on a system running 12.6 (Monterey) and the unit tests print the info: 3: GL Vendor: Apple 3: GL Renderer: Apple M1 Pro 3: GL Version: 2.1 Metal - 76.3 3: GLSL Version: 1.20

We tested again on Ventura, which prints a higher Metal version, and the tests pass, so we may be able to close this as fixed on the Apple side, after investigating a bit further.

JGoldstone commented 1 year ago

On the latest Ventura:

2/2 Test #4: test_metal .......................***Failed 2.39 sec

GL Vendor: Apple
GL Renderer: Apple M1 Max
GL Version: 2.1 Metal - 83.1
GLSL Version: 1.20

OpenColorIO_Core_GPU_Unit_Tests

Here's the first few errors in test_metal. Interestingly, of the 211 tests, all is good up to and including test 186; but test 187 fails below, as does every single test thereafter even including those that don't have anything to do with LUT inversion. Is it possible that somehow test 187 or one or more of its near neighbors could fail in such a way as to poison test state such that everything past that point appears to fail?

I've included (if I am using this GitHub UI correctly) a compressed text file with the output of ctest run with --rerun-failed --output-on-failure as requested. Note that there are some CPU failures as well.

If there's anything unusual here it might be that I'm running Python 3.12.0a6 but the Python tests passed just fine FWIW. Suggested next steps I might take?

[187/211] [Lut3DOp / inv3dlut_file_spi3d_linear ] - FAILED -
Maximum error: 0.4069002271 at pixel: 38758 on component 1 larger than epsilon.
scr = {0.7742004395, 0.7742118835, 0.7742233276, 0.7742347717}
cpu = {0.5893889666, 0.587680161, 0.6203746796, 0.7742347717}
gpu = {0.9860675335, 0.9945803881, 0.9656506181, 0.7742347717}
absolute tolerance=0.001200000057
[188/211] [Lut3DOp / inv3dlut_file_spi3d_tetra ] - FAILED -
Maximum error: 0.4069002271 at pixel: 38758 on component 1 larger than epsilon.
scr = {0.7742004395, 0.7742118835, 0.7742233276, 0.7742347717}
cpu = {0.5893889666, 0.587680161, 0.6203746796, 0.7742347717}
gpu = {0.9860675335, 0.9945803881, 0.9656506181, 0.7742347717}
absolute tolerance=0.001200000057
[189/211] [Lut3DOp / 3dlut_file_spi3d_bizarre_linear ] - FAILED -
Maximum error: 0.356887579 at pixel: 36363 on component 0 larger than epsilon.
scr = {0.66456604, 0.6645774841, 0.6645889282, 0.6646003723}
ocio-test-failure.txt.gz

doug-walker commented 1 year ago

@JGoldstone , the tests are independent, so having one fail should not be able to fail the following ones. There seems to be something strange happening because so many of your CPU tests failed too. We have seen this sort of thing either with different chip architectures or with "fast math" type compilation options. It sounds like you were not enabling any unusual math options so I'm guessing it is either something related to the specific M1 chip or the specific compiler version you are using.

My development environment is an M1 Mac, so it is definitely a platform that we believe works, although I'm not using the latest OS/compiler. I have it on my todo list to upgrade to the latest and will check if I run into similar issues.