RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.25k stars 1.26k forks source link

[render_vtk] Unit test failure on Mac Ventura M1 #19424

Open jwnimmer-tri opened 1 year ago

jwnimmer-tri commented 1 year ago

Part of #18327. For prior art see #17566.

We're not sure if this is our buggy code, or buggy graphics driver stack, or buggy CI hardware, or etc.

For now, I'm going to nerf the test to get CI passing.


[ RUN      ] RenderEngineVtkTest.BoxTest
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:328: Failure
Value of: CompareColor(expected_outlier_color_, color, screen_coord)
  Actual: false (Expected: (0, 0, 0, 255) at (10, 10), tested: (254, 127, 0, 255) with tolerance: 1.0009999999999999)
Expected: true
Color at: (10, 10) for test: Box test - diffuse color
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:331: Failure
Value of: IsExpectedDepth(depth, screen_coord, expected_outlier_depth_, kDepthTolerance)
  Actual: false (Expected depth at (10, 10) to be 3. Found inf. Difference inf is greater than tolerance 0.0010000000474974513)
Expected: true
Depth at: (10, 10) for test: Box test - diffuse color
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:333: Failure
Expected equality of these values:
  label.at(x, y)[0]
    Which is: 32766
  expected_outlier_label_
    Which is: 32764
Label at: (10, 10) for test: Box test - diffuse color
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:328: Failure
Value of: CompareColor(expected_outlier_color_, color, screen_coord)
  Actual: false (Expected: (0, 0, 0, 255) at (10, 469), tested: (254, 127, 0, 255) with tolerance: 1.0009999999999999)
Expected: true
Color at: (10, 469) for test: Box test - diffuse color
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:331: Failure
Value of: IsExpectedDepth(depth, screen_coord, expected_outlier_depth_, kDepthTolerance)
  Actual: false (Expected depth at (10, 469) to be 3. Found inf. Difference inf is greater than tolerance 0.0010000000474974513)
Expected: true
Depth at: (10, 469) for test: Box test - diffuse color
geometry/render_vtk/test/internal_render_engine_vtk_test.cc:333: Failure
Expected equality of these values:
  label.at(x, y)[0]
    Which is: 32766
  expected_outlier_label_
    Which is: 32764
Label at: (10, 469) for test: Box test - diffuse color
[  FAILED  ] RenderEngineVtkTest.BoxTest (3880 ms)
EricCousineau-TRI commented 1 year ago

Assigning Zach for triage / delegation per component lead chart

jwnimmer-tri commented 1 year ago

Since we're not sure if this is a broken CI machine or not, I'll change the assignment to @svenevs for the moment, and move this to the project board. I don't think it'll be urgent.

svenevs commented 1 year ago

f2f note: when trying to bisect vtk:

svenevs commented 1 year ago

while debugging the tests, dump images to disk, example:

bindings/pydrake/visualization/test/video_test.py:        filename = os.environ["TEST_UNDECLARED_OUTPUTS_DIR"] + "/color.gif"
jwnimmer-tri commented 1 year ago

I should clarify the goal here.

The first call to action is to characterize what's happening. Are the test failures minor (akin to a tolerancing issue), or do they indicate major problems? In case the errors in are image-comparison test cases, we could dump the images to disk and visually compare them to try to get a sense of what's going wrong. Are the test failures reproducible locally vs only in CI? Are the failures dependent on the order that test cases are run in, or which cases are enabled/disabled?

Once we have that kind of information, we can decide how much more effort to invest in trying to fix it.

jwnimmer-tri commented 6 months ago

FYI I'm elevating the priority of this, since we're going to lose our x86 macOS CI coverage soon.

svenevs commented 6 months ago

https://github.com/RobotLocomotion/drake-ci/blob/5f0457c43e881caf610981b5e6854c78e1bea300/ctest_driver_script_wrapper.bash#L40-L46

This may very well be the root cause of the problem. If you deploy a VM and ScreenShare in, the tests all pass, same as being on a non-virtualized m1.

$ man arch
...
The arch_name argument must be one of the currently supported
     architectures:
           i386     32-bit intel
           x86_64   64-bit intel
           x86_64h  64-bit intel (haswell)
           arm64    64-bit arm
           arm64e   64-bit arm (Apple Silicon)

When nightlies are done / I can run tests via jenkins again, plan is to try arm64e. This one is certainly very peculiar. It does seem like the orka3 update improved the results on #20522, 2/2 tests fail on the same location. Will push something up to see if it repeats, but we may want to consider restoring an intermediate solution (only filter out the known-failing tests).

svenevs commented 6 months ago

Some additional findings, the test appears to be too resource heavy somehow. I thought that was because of the image saving tests from #20470, but as it turns out, it happens even if you are only running internal_render_engine_vtk_test (https://github.com/RobotLocomotion/drake/pull/20470#pullrequestreview-1950245426).

It seems like maybe there are some JVM arguments that could be getting added to the orka agent on the Jenkins cloud settings, but it is not clear to me at this juncture what those arguments may or may not be.

See also: https://github.com/RobotLocomotion/drake-ci/pull/269#issuecomment-2010575474

There's some weird setup as to where things are actually building on the macs. It could possibly be related to that, but it seems unlikely. On the Ubuntu side, we have the init_script create the filesystem for /tmp on the newly attached EBS volume when the instance launches with jvm options -Djava.io.tmpdir=/media/ephemeral0/tmp. On macOS, we just use everything directly (since the storage is already there).

But things that are worth trying would be to change the heap size, or possibly some other java flags related to memory and/or cpu usage (?). This is hard to diagnose :disappointed:

svenevs commented 6 months ago

Ok, I'm starting to run out of ideas. No discernable java flags made an impact. The only other thought I had was that we could try splitting internal_render_engine_vtk_test into multiple different test files. I gave an initial (dirty and quick...) attempt at that, but splitting it into a library isn't valid, none of the tests actually run (ASSERT_EQ(true, false) for example...).

https://github.com/RobotLocomotion/drake/commit/841e3a836523b78b1c977b062636482bd766bc6a

Is there a way to do that in a straightforward way? Take the class definitions and put them in a header file, and just have multiple different test files #include it? (rather than link against a test library trying to do the same thing) I'm not particularly a gflags / gtest expert, but the idea was basically: if all the tests being in one file is too resource intensive, splitting it into multiple smaller tests may be successful.

Unfortunately, I'm not quite sure what else to try :cry: