PixarAnimationStudios / OpenUSD

Universal Scene Description
http://www.openusd.org
Other
5.5k stars 1.14k forks source link

regression in testUsdImagingGLInstancing_nestedInstance #2097

Closed pmolodo closed 1 year ago

pmolodo commented 1 year ago

Description of Issue

Somewhere in the commit range d27b1be4..9d21755b, we started seeing consistent failures of the testUsdImagingGLInstancing_nestedInstance:

Baseline: image

Generated: image

Possibly related, or may just be a red herring - noticed this in my error log:

------- Stderr: -------
Warning: in RunTest at line 724 of /buildAgent/work/f651207570849f20/USD/pxr/usdImaging/usdImagingGL/unitTestGLDrawing.cpp -- Draw mode  not supported!
Warning: in RunTest at line 736 of /buildAgent/work/f651207570849f20/USD/pxr/usdImaging/usdImagingGL/unitTestGLDrawing.cpp -- Cull style  not supported!

Steps to Reproduce

  1. Enable GL image tests, build, and run testUsdImagingGLInstancing_nestedInstance

System Information (OS, Hardware)

Ubuntu 20.04.5 LTS, NVIDIA A10G

Package Versions

USD 9d21755b

sunyab commented 1 year ago

I believe this is fixed in the latest set of changes pushed to dev. Can you retry the test on your side and see if that's the case for you?

pmolodo commented 1 year ago

We're still seeing this on latest dev, https://github.com/PixarAnimationStudios/USD/commit/43671cf0004de7268d5aec393d677c177908d86a.

I'll see if I can set up an automated bisect to get an exact commit where the test first started failing for us...

pmolodo commented 1 year ago

Hi - so it looks like the commit the introduced the regression was bf181901633b9677a21a543824cbec151e61d8ba - which was actually where one of our MRs was merged, https://github.com/PixarAnimationStudios/USD/pull/1856.

And I'm now remembering that we had the same problem earlier - in the discussion for that MR, you can see essentially the same image pair that I've posted here. @WRMA6 mentioned that a change related to testUsdImagingGLInstancing_nestedInstance was fixed, which I took to mean the rendering error we were seeing was resolved... but it would seem not.

I think we were always getting this result, but the test fix now properly exposed it. I'm assuming the test is passing on your infrastructure?

sunyab commented 1 year ago

Yes, I double-checked the test results from prior to my last push to dev and the test was passing. I double-checked the images and they look correct as well.

I think the test had been failing internally with the push before my last push, but it got fixed sometime afterwards. So I'm surprised it's still failing for you. I can re-run the tests tomorrow with the latest set of internal changes to see how things look.

sunyab commented 1 year ago

Filed as internal issue #USD-7782

FlorianZ commented 1 year ago

Closing this as fixed, but feel free to re-open if this is still failing for you.

pmolodo commented 1 year ago

I just checked again, and I'm still seeing this on the latest dev (dce6cf9da8ca8a03291b8f7bbe9f1e4c4af32796).

This time my system info was:

OS: Ubuntu 20.04.5 LTS GPU: NVIDIA GeForce RTX 3080 Driver: 510.108.03

pmolodo commented 1 year ago

Also, I don't think I have permission to re-open the issue. Should I create a new one?

spiffmon commented 1 year ago

I'll also reopen the internal ticket. @pmolodo , were you or others going to work on fixing the regression, or should we triage this? May be tricky if we cannot reproduce it locally!

pmolodo commented 1 year ago

Thanks, @spiffmon. Unfortunately, I don't think I'm going to be able to tackle this immediately, as it's not really a blocker for us, (we've disabled the test in our build system), and focus around here is more on our RTX renderer than Storm support. On the other hand, I hate seeing bugs / regressions go unaddressed, and I can't expect you folks to tackle a bug that you can't reproduce. It seems that the situation we're in is that this bug is reliably reproducible on all our systems, and not on any of your systems. So I'll at least see if I can reproduce on my own personal Ubuntu box (which should be relatively vanilla), using a straight invocation of build_usd.py, and see where that leaves us.

spiffmon commented 1 year ago

Much appreciated, @pmolodo !

pmolodo commented 1 year ago

So I was eventually able to reproduce the issue on my personal Ubuntu box, and what's more, was able to create a Dockerfile that can be used to build (and test, if you have nvidia-docker2 installed):

https://github.com/pmolodo/usd_docker_ubuntu/blob/testUsdImagingGLInstancing_nestedInstance/Dockerfile

Instructions for doing so are in the Dockerfile.

Unless you never use NVIDIA hardware, I'm guessing that the key difference is Ubuntu. (Probably you folks use some rpm-based distribution?) Probably in execution environment, rather than build environment, since we actually do all our builds on an old CentOS image. Though it would be good to have that confirmed too.

Anyway, let me know if you're able to replicate the issue using this!

davidgyu commented 1 year ago

Thanks @pmolodo I've been able to repro using your Dockerfile, but interestingly not when using a clean dev build on the same host. This is also using a personal Ubuntu box:

OS: Ubuntu 20.04.5 LTS GPU: NVIDIA RTX 3080 Ti Driver: 525.60.11

Being able to repro is super helpful, though! Hopefully, more info soon.

pmolodo commented 1 year ago

Glad you were able to repro! Odd that it didn't with a build on the same host... Out of curiosity, did you both build and test on the Docker image, or just build and then test on the host?

davidgyu commented 1 year ago

I've only been able to run the repro inside of docker. When I try to run from the docker build on the host I hit GLIBC version errors on my Ubuntu 20.04 system.

davidgyu commented 1 year ago

Alright! The existing test baseline represents an invalid result. The two meshes which are rendering incorrectly are quad dominant subdivision surface meshes bound to a triangle ptex texture. Also, the ptexFaceOffset values specified in the test asset are invalid. I've updated the test to use a quad ptex texture with correct ptexFaceOffset values and I'm seeing consistent results. I've also filed an internal ticket to improve error checking for these cases since ideally we'd have caught these errors earlier.

pmolodo commented 1 year ago

Great work, thanks! 🙏

davidgyu commented 1 year ago

One more thing... I did add --ptex to the arg list for the build_usd.py run step in the Dockerfile since Ptex is not enabled by default. I think this by itself might be enough to get the tests passing, but the underlying test asset fixes are good to have as well. We've also fixed the spurious warning message mentioned above in the initial post on this issue.

pmolodo commented 1 year ago

Hmm... Does that mean this test should only be run if there's ptex support enabled?

davidgyu commented 1 year ago

Yes, that's correct. We have a ticket open here to fix this. I suspect this got overlooked when we added the test registration to CMake since we always have Ptex enabled in our internal builds.