CI passing for M1 builds on Apple silicon (ARM64)

jwnimmer-tri commented 2 years ago

CI support for ARM64 builds on M1 machines will be ready soon (#16380). Once it's ready, we should work to get CI passing in that Nightly Unprovisioned Release configuration.

We can use this issue to track the overall goal, and an inventory of known problems:

[x] #16432
[x] #17122
[x] #17196

[X] OpenGL Context problems

[7:28:09 PM]  2022-04-21 23:28:09.146 (   2.157s) [           4101C]vtkCocoaRenderWindow.mm:1018  WARN| vtkCocoaRenderWindow (0x136164790): No OpenGL context whatsoever could be created!

[ ] #17565 (AWS firewall issues with gurobi license)

[12:16:30 PM]  C++ exception with description "Could not create Gurobi environment because Gurobi returned code 10009 with message "Failed to connect to token server 'ip-XXX-XX-XX-XXX' (port YYYYY) - license file '/Library/gurobi951/gurobi.lic'.  Consult the Quick Start Guide for instructions on starting a token server."." thrown in the test body.

The goal here is to open various pull requests that each monotonically improve the situation, until at the end we have passing CI. It's fine to disable Drake features (e.g., dReal) in order to accomplish that, to finish this work more quickly. We can open a tracking issue to remind ourselves to re-enable those features later.

See also #13514 for an overall tracking issue, and some discussions about known problems.

jwnimmer-tri commented 2 years ago

\CC @mwoehlke-kitware @svenevs FYI

svenevs commented 2 years ago

For parties interested in working on this, at this time the jobs are listed under the Mac M1 Jenkins Tab. At this time, due to technical complications, only unprovisioned is available. Install time takes around 10 minutes, your PR can add @drake-jenkins-bot mac-m1-monterey-unprovisioned-clang-bazel-experimental-release please. to test the unprovisioned M1 build.

svenevs commented 2 years ago

Went through the remaining output, just calling out some notable ones. The failed solver builds are likely related, so fixing any tests should wait until we resolve build issues. Jenkins job log is here.

`multibody/plant/test/tamsi_solver_test.cc`

Possible easy fix with adjusting tolerance to be more generous.

[8:52:21 PM]  multibody/plant/test/tamsi_solver_test.cc:1006: Failure
[8:52:21 PM]  Value of: CompareMatrices(J, J_expected, J_tolerance, MatrixCompareType::relative)
[8:52:21 PM]    Actual: false (Values at (0, 0) exceed tolerance: 247883.36025055303 vs. 247883.36025055134, diff = 1.6880221664905548e-09, tolerance = 6.6613381477509392e-15, relative tolerance = 1.6512348838296977e-09
[8:52:21 PM]  m1 =
[8:52:21 PM]   247883.36025055303 -259.07339118332754  247882.36025055303
[8:52:21 PM]                    0  3120.6847005105024                   0
[8:52:21 PM]   247882.36025055303 -259.07339118332754  247883.36025055303
[8:52:21 PM]  m2 =
[8:52:21 PM]   247883.36025055134 -259.07339118332595  247882.36025055134
[8:52:21 PM]                    0  3120.6847005105024                   0
[8:52:21 PM]   247882.36025055134 -259.07339118332595  247883.36025055134
[8:52:21 PM]  delta=
[8:52:21 PM]   1.6880221664905548e-09 -1.5916157281026244e-12  1.6880221664905548e-09
[8:52:21 PM]                        0                       0                       0
[8:52:21 PM]   1.6880221664905548e-09 -1.5916157281026244e-12  1.6880221664905548e-09)

`solvers/test/cost_test.cc`

4 failures related to similar problems.

[8:55:57 PM]  z.cwiseAbs().sum() evaluates to 5.7460000000000004,
[8:55:57 PM]  math::ExtractValue(y)[0] evaluates to 5.7460000000000013.
[8:55:57 PM]  The abs_error parameter 1e-16 evaluates to 9.9999999999999998e-17 which is smaller than the minimum distance between doubles for numbers of this magnitude which is 8.8817841970012523e-16, thus making this EXPECT_NEAR check equivalent to EXPECT_EQUAL. Consider using EXPECT_DOUBLE_EQ instead.
[8:55:57 PM]  solvers/test/cost_test.cc:406: Failure
[8:55:57 PM]  The difference between z.cwiseAbs().sum() and y[0].Evaluate(env) is 1.7763568394002505e-15, which exceeds 1e-15, where
[8:55:57 PM]  z.cwiseAbs().sum() evaluates to 5.7460000000000004,
[8:55:57 PM]  y[0].Evaluate(env) evaluates to 5.7460000000000022, and
[8:55:57 PM]  1e-15 evaluates to 1.0000000000000001e-15.

`geometry/proximity/test/penetration_as_point_pair_callback_test.cc`

[8:57:34 PM]  geometry/proximity/test/penetration_as_point_pair_callback_test.cc:240: Failure
[8:57:34 PM]  Expected equality of these values:
[8:57:34 PM]    ExtractDoubleOrThrow(first_result.depth)
[8:57:34 PM]      Which is: 0.1
[8:57:34 PM]    point_pairs_double[0].depth
[8:57:34 PM]      Which is: 0.1
[8:57:34 PM]  [  FAILED  ] PenetrationAsPointPairCallbackTest.SphereCapsuleAutoDiffXd (0 ms)

`multibody/plant/test/multibody_plant_reaction_forces_test.cc`

Some close calls with tolerances:

[8:59:23 PM]  multibody/plant/test/multibody_plant_reaction_forces_test.cc:468: Failure
[8:59:23 PM]  Value of: CompareMatrices(F_BJb_Jb.translational(), f_BJb_JINFO: From Testing //multibody/plant:multibody_plant_reaction_forces_test:
[8:59:23 PM]  b_expected, kTolerance)
[8:59:23 PM]    Actual: false (Values at (1, 0) exceed tolerance: -37.499999999999993 vs. -37.5, diff = 7.1054273576010019e-15, tolerance = 4.4408920985006262e-15
[8:59:23 PM]  m1 =
[8:59:23 PM]  -2.5834338933852281e-15
[8:59:23 PM]      -37.499999999999993
[8:59:23 PM]                       15
[8:59:23 PM]  m2 =
[8:59:23 PM]      0
[8:59:23 PM]  -37.5
[8:59:23 PM]     15
[8:59:23 PM]  delta=
[8:59:23 PM]  -2.5834338933852281e-15
[8:59:23 PM]   7.1054273576010019e-15
[8:59:23 PM]                        0)
[8:59:23 PM]  Expected: true

`systems/controllers/test/finite_horizon_linear_quadratic_regulator_test.cc`

[9:07:12 PM]  systems/controllers/test/finite_horizon_linear_quadratic_regulator_test.cc:267: Failure
[9:07:12 PM]  Value of: CompareMatrices(result.S->value(t0), lqr_result.S, 1e-5)
[9:07:12 PM]    Actual: false (Values at (0, 0) exceed tolerance: 0.13183043253623802 vs. 0.13184591346023167, diff = 1.5480923993643758e-05, tolerance = 1.0000000000000001e-0INFO: From Testing //systems/controllers:finite_horizon_linear_quadratic_regulator_test:

`examples/compass_gait/test/compass_gait_test.cc`

[9:12:09 PM]  [ RUN      ] CompassGaitTest.TestCollisionDynamics
[9:12:09 PM]  examples/compass_gait/test/compass_gait_test.cc:208: Failure
[9:12:09 PM]  The difference between angular_momentum_before and angular_momentum_after is 2.4868995751603507e-14, which exceeds 5e-15, where
[9:12:09 PM]  angular_momentum_before evaluates to 26.381829820086555,
[9:12:09 PM]  angular_momentum_after evaluates to 26.38182982008653, and
[9:12:09 PM]  5e-15 INFO: From Testing //examples/compass_gait:compass_gait_test:
[9:12:09 PM]  evaluates to 5e-15.

jwnimmer-tri commented 2 years ago

Yes, generally we should defer work on unit test failures until the end, especially numerical-tolerance ones.

For any tolerancing failures, we will not try to fix them within the build system team in any case. We will open a ticket against the feature test's maintainer to resolve it (after the build fixups are complete).

jwnimmer-tri commented 2 years ago

We will open a ticket against the feature test's maintainer to resolve it.

One amendment to this... if we like, there's a slightly different plan that we (the build system team) could use:

For each test-tolerance failure (i.e., each test program whose only failures are epsilon-scale comparison differences), we could open a Pull Request that increases the comparison tolerance, and assign is to the feature's maintainer. If they are satisfied with the change, they can approve and we just merge it without the extra hoopla of an issue number. If they would like to investigate on their own and solve it a different way, they can do so and either push tweaks to the PR, or open a new PR instead.

The pull request number serves a similar purpose to what an issue number would be -- a linkable, central point to discuss a particular test failure and what to do about it.

svenevs commented 2 years ago

Been working on the OpenGL context issues, there seem to be some odd failures in RenderEngineVTK:

[ RUN      ] RenderEngineVtkTest.TerrainTest 
geometry/render/test/render_engine_vtk_test.cc:223: Failure                                                                                                                                                                                                       
Value of: CompareColor(test_color, *color, ScreenCoord{x, y})                                                                                                                                                                                                     
  Actual: false (Expected: (0, 0, 0, 255) at (0, 0), tested: (254, 127, 0, 255) with tolerance: 1.0009999999999999)                                                                                                                                               
Expected: true                                                                                                                                                                                                                                                    
Google Test trace:                                                                                                                                                                                                                                                
geometry/render/test/render_engine_vtk_test.cc:524: Valid depth return: 4.9999                                                                                                                                                                                    
geometry/render/test/render_engine_vtk_test.cc:233: Failure                                                                                                                                                                                                       
Expected equality of these values:                                                                                                                                                                                                                                
  label->at(x, y)[0]                                                                                                                                                                                                                                              
    Which is: 32766                                                                                                                                                                                                                                               
  value                                                                                                                                                                                                                                                           
    Which is: 32764                                                                                                                                                                                                                                               
At pixel (0, 0)                                                                                                                                                                                                                                                   
Google Test trace:                                                                                                                                                                                                                                                
geometry/render/test/render_engine_vtk_test.cc:524: Valid depth return: 4.9999                                                                                                                                                                                    
geometry/render/test/render_engine_vtk_test.cc:251: Failure                                                                                                                                                                                                       
The difference between depth->at(x, y)[0] and value is inf, which exceeds kDepthTolerance, where                                                                                                                                                                  
depth->at(x, y)[0] evaluates to inf,                                                                                                                                                                                                                              
value evaluates to 4.9998998641967773, and                                                                                                                                                                                                                        
kDepthTolerance evaluates to 0.001.                                                                                                                                                                                                                               
Google Test trace:                                                                                                                                                                                                                                                
geometry/render/test/render_engine_vtk_test.cc:524: Valid depth return: 4.9999                                                                                                                                                                                    
geometry/render/test/render_engine_vtk_test.cc:223: Failure                                                                                                                                                                                                       
Value of: CompareColor(test_color, *color, ScreenCoord{x, y})                                                                                                                                                                                                     
  Actual: false (Expected: (0, 0, 0, 255) at (0, 0), tested: (254, 127, 0, 255) with tolerance: 1.0009999999999999)                                                                                                                                               
Expected: true                                                                                                                                                                                                                                                    
Google Test trace:                                              
geometry/render/test/render_engine_vtk_test.cc:534: Closer than near
geometry/render/test/render_engine_vtk_test.cc:233: Failure
Expected equality of these values:
  label->at(x, y)[0]                                            
    Which is: 32766                                             
  value                                                         
    Which is: 32764
At pixel (0, 0)                                                 
Google Test trace:                                              
geometry/render/test/render_engine_vtk_test.cc:534: Closer than near
geometry/render/test/render_engine_vtk_test.cc:251: Failure
The difference between depth->at(x, y)[0] and value is inf, which exceeds kDepthTolerance, where
depth->at(x, y)[0] evaluates to inf,
value evaluates to 0, and                                       
kDepthTolerance evaluates to 0.001.
Google Test trace:                                              
geometry/render/test/render_engine_vtk_test.cc:534: Closer than near
[  FAILED  ] RenderEngineVtkTest.TerrainTest (4459 ms)

and

[ RUN      ] RenderEngineVtkTest.HorizonTest
geometry/render/test/render_engine_vtk_test.cc:599: Failure
The difference between expected_horizon and actual_horizon is 263.17645019878171, which exceeds 1.001, where
expected_horizon evaluates to 263.17645019878171,
actual_horizon evaluates to 0, and
1.001 evaluates to 1.0009999999999999.
z = 2                                                           
[  FAILED  ] RenderEngineVtkTest.HorizonTest (511 ms)

and one that seems like it could be ignored

[ RUN      ] RenderEngineVtkTest.MeshTest
[2022-04-29 17:52:14.723] [console] [warning] Requested diffuse map could not be found: bad_path
[       OK ] RenderEngineVtkTest.MeshTest (1454 ms)

AFAIU bazel test //geometry/render:render_engine_vtk_test does not get hit by any of the other compilation related issues, so it seems likely the tests will need to be adapted.

jwnimmer-tri commented 2 years ago

For work on this issue, I'd like to prioritize getting the compilation issues resolved first (so that people stop bothering Russ about that), and thus defer test failures as a lower priority. Of course that will be subject to available staffing/skills so it's not a hard and fast rule, but I thought I'd at least make the priority clear, and we can match it within the limits of the practical.

svenevs commented 2 years ago

Works for me :+1: The OpenGL one has a ticket out, that's not something anyone else can solve. For now that's on hold.

For the remaining ones, we really just want to disable them by default using drake's bazel configs right (e.g., disable petsc)?

jwnimmer-tri commented 2 years ago

We can't easily disable Petsc -- it's not a wrapped SolverInterface (like dReal or IBEX is wrapped), it's called directly by our library code. In #16432, I explained the next step for getting Petsc working. I think there's a >50% chance that it's a simple, one-hour fix.

For dReal and IBEX, it's fine to disable them for now. If we go that route, it needs to happen automatically on M1, even for CMake builds. I don't want to "disable" it via documentation instructions.

mwoehlke-kitware commented 2 years ago

I explained the next step for getting Petsc working. I think there's a >50% chance that it's a simple, one-hour fix.

The issue there is going to be that the petscconf.h is different on M1 vs. x86. That I'm not sure how to implement. Otherwise, yes, it shouldn't be too bad. IIRC, though, I had a successful build at one point after a bunch of bludgeoning away at petscconf.h, so (relatively) "simple" should at least be true. I'll post more in #16432.

RussTedrake commented 2 years ago

FWIW -- Re: quadmath. My patch has been building fine by simply removing the dependency. And if you're able to resolve this one for me, I can (finally) stop carrying the patch.

svenevs commented 2 years ago

Re: quadmath. My patch has been building fine by simply removing the dependency.

@RussTedrake I apologize for my ignorance ... would you be willing to link the patch? I'm having a hard time finding what we're doing or what the problem even is with quadmath. I went through the build logs again but if there's something there I missed it / it gets drowned out by the petsc stuff. The only other thing described here is in the checklist quadmath something something :no_mouth: The only thing that comes up in the code is gfortran but I couldn't find any m1 specific patches on your fork with that.

OpenGL Context problems

Update on this one, 95% certain it is resolved now but need to rebuild some images. There's a related ticket out on that but if they don't respond soon will just rebuild them now. It's a test failure (not a build failure), and VTK tests will need to get fixed regardless of CI image status https://github.com/RobotLocomotion/drake/issues/17009#issuecomment-1113582278

RussTedrake commented 2 years ago

My patch is here: https://github.com/RussTedrake/drake/commit/38b013a3a0ceb21a16d43c89a6458efdc03f0b1c (Sorry I didn't link it before) The ibex part of this patch is moot for now, since it's disabled on arm64.

To be clear, this patch has been enough for me to run an install, but not enough to pass all the tests (nor even build all of drake; petsc for instance was not included in any install artifacts).

RussTedrake commented 2 years ago

And to be clear, the reason it doesn't look "m1 specific" is because I carry this patch around every time I develop on M1, but do not try to use it on any platforms. Adding a conditional in the bazel to remove quadmath iff arm64 might be enough that I don't need the patch anymore.

svenevs commented 2 years ago

Ok great, thanks! I'll work on removing linking against quadmath in the gfortran repo tomorrow :slightly_smiling_face:

jwnimmer-tri commented 2 years ago

I just peeked at https://packages.ubuntu.com/focal/libgfortran5 and can confirm that quadmath is [amd64, i386] only. When we're building on ARM64, we should neither try to find -lquadmath, nor link to it.

jwnimmer-tri commented 2 years ago

I'll work on removing linking against quadmath in the gfortran repo tomorrow

Please always use the https://github.com/RobotLocomotion/drake/projects/2 prioritization to guide your work. This ticket is in "To Do" -- Kitware should not start working on it until the other, higher-priority cards are finished first (all of "In Progress", and anything higher up in the To Do list).

jwnimmer-tri commented 2 years ago

For the record, as of two nights ago the build step works perfectly now. There are still some runtime test failures to dig through, of various causes.

jwnimmer-tri commented 2 years ago

FYI I've demoted this from "To do" to "On deck" because per @ToffeeAlbina-TRI we do not yet have a contract in place to pay for the M1 virtual machines. Without CI to back us up, it's not worth trying to fix build or test errors anymore. We can revisit once we have virtual machine contract in place.

jwnimmer-tri commented 2 years ago

We have $$ for this CI now. We can resume work.

jwnimmer-tri commented 2 years ago

Setting the Gurobi issues aside (#16380) and looking only at mac-m1-monterey-unprovisioned-clang-bazel-nightly-release, the only remaining problem seems to be a test failure in //geometry/render_vtk:internal_render_engine_vtk_test -- shown in more detail above at https://github.com/RobotLocomotion/drake/issues/17009#issuecomment-1113582278.

jwnimmer-tri commented 2 years ago

... the only remaining problem seems to be a test failure ...

Filed as #17566 for better tracking, so closing this now.

RobotLocomotion / drake