Jammy CI tests failing with: "X Error of failed request: BadShmSeg (invalid shared segment parameter)"

BetsyMcPhail commented 1 year ago

What happened?

Some tests begain failing in Jammy unprovisioned jobs on January 30th. For example, https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/job/linux-jammy-unprovisioned-gcc-bazel-nightly-release/201/

The failing tests are:

//bindings/pydrake/geometry:py/scene_graph_test
//bindings/pydrake/visualization:py/model_visualizer_test
//bindings/pydrake/visualization:py/video_test
//examples/hardware_sim:py/hardware_sim_cc_test
//examples/hardware_sim:py/hardware_sim_py_test
//geometry/render_gltf_client:py/acceptance_test
//geometry/render_gltf_client:py/integration_test
//tutorials:py/rendering_multibody_plant_test

They all fail with the message:

....X Error of failed request:  BadShmSeg (invalid shared segment parameter)
  Major opcode of failed request:  130 (MIT-SHM)
  Minor opcode of failed request:  3 (X_ShmPutImage)
  Segment id in failed request:  0x200005
  Serial number of failed request:  46
  Current serial number in output stream:  47

Version

No response

What operating system are you using?

Ubuntu 22.04

What installation option are you using?

No response

Relevant log output

No response

BetsyMcPhail commented 1 year ago

See slack discussion: https://drakedevelopers.slack.com/archives/C270MN28G/p1675172012911669

BetsyMcPhail commented 1 year ago

After comparing package versions on a working job and a failing job, we found that updating libglapi-mesa, which updates libegl-mesa0, libgbm1, libgl1-mesa-dri, and libglx-mesa0 and installs libllvm15, causes the failure.

BetsyMcPhail commented 1 year ago

We don't directly install any of these gl/mesa packages, instead they are installed as dependencies of xvfb as part of the CI setup. Therefore, none of the gl/mesa packgaes or xvfb are installed on the unprovisioned images. If we simply try to apt-mark hold the gl/mesa packages, xvfb can't be installed.

I ran the following commands to update the unprovisioned Jammy image:

sudo apt-get update # Do NOT upgrade
sudo apt-get install <package name>=22.0.1-1ubuntu2 # For the packages listed below
sudo apt-mark hold libglapi-mesa libgbm1 libegl-mesa0  libgl1-mesa-dri libglx-mesa0 
sudo apt-get install xvfb # Should see message about packages *not* being upgraded
sudo apt-get update
sudo apt-get upgrade

Then, fix the prompt issue (slack thread) by following the instructions here.

xvfb and its dependencies are already installed on the provisioned Jammy image, so that only needed:

sudo apt-mark hold libglapi-mesa libgbm1 libegl-mesa0  libgl1-mesa-dri libglx-mesa0 
sudo apt-get update
sudo apt-get upgrade

Note: Both updated images are dated 2023-02-02

BetsyMcPhail commented 1 year ago

Last night, Jammy unprovisioned builds failed with the error:

[10:23:14 AM]  Errors were encountered while processing:
[10:23:14 AM]   shim-signed
[10:23:14 AM]  needrestart is being skipped since dpkg has failed
[10:23:15 AM]  E: Sub-process /usr/bin/dpkg returned an error code (1)

As this was also an error on Jammy unprovisioned images, adding it to this ticket.

Slack discussion here: https://drakedevelopers.slack.com/archives/C270MN28G/p1678114026913369

spiral009 commented 1 year ago

mkdir /dev/shm

jwnimmer-tri commented 1 year ago

Are these packages still marked "on hold" in CI jobs? I didn't see any code in drake-ci that reflects the hold.

jwnimmer-tri commented 1 year ago

Are these packages still marked "on hold" in CI jobs? I didn't see any code in drake-ci that reflects the hold.

Given my experience with https://github.com/RobotLocomotion/drake-blender/pull/10, I suspect that the hold is still required. I suppose the call to action for this issue, then, is to reflect the required package configuration details back into either the drake-ci scripting or at worst some CI documentation.

BetsyMcPhail commented 1 year ago

Are these packages still marked "on hold" in CI jobs? I didn't see any code in drake-ci that reflects the hold.

They are on hold in the Jammy AMIs. The drake-ci scripts haven't been updated to reflect this, but they should be.

svenevs commented 1 year ago

An linux-jammy-gcc-bazel-continuous-release instance today for //bindings/pydrake/visualization:py/model_visualizer_test:

[11:46:25 AM]  FAIL: //bindings/pydrake/visualization:py/model_visualizer_test (see /media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-continuous-release/_bazel_ubuntu/e49b56940e12c1ddd58d01fade4e7033/execroot/drake/bazel-out/k8-opt/testlogs/bindings/pydrake/visualization/py/model_visualizer_test/test.log)
[11:46:25 AM]  ==================== Test output for //bindings/pydrake/visualization:py/model_visualizer_test:
[11:46:25 AM]  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dv8v3sr5 because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[11:46:25 AM]  
[11:46:25 AM]  Running tests...
[11:46:25 AM]  ----------------------------------------------------------------------
[11:46:25 AM]  .INFO:drake:Meshcat listening for connections at http://localhost:7003/
[11:46:25 AM]  X Error of failed request:  BadValue (integer parameter out of range for operation)
[11:46:25 AM]    Major opcode of failed request:  130 (MIT-SHM)
[11:46:25 AM]    Minor opcode of failed request:  3 (X_ShmPutImage)
[11:46:25 AM]    Value in failed request:  0x5c0
[11:46:25 AM]    Serial number of failed request:  46
[11:46:25 AM]    Current serial number in output stream:  47
[11:46:25 AM]  ================================================================================

svenevs commented 1 year ago

Another linux-focal-clang-bazel-nightly-everything-debug:

[2:53:33 AM]  .INFO:drake:Meshcat listening for connections at http://localhost:7000/
[2:53:33 AM]  X Error of failed request:  BadValue (integer parameter out of range for operation)
[2:53:33 AM]    Major opcode of failed request:  130 (MIT-SHM)
[2:53:33 AM]    Minor opcode of failed request:  3 (X_ShmPutImage)
[2:53:33 AM]    Value in failed request:  0x5c0
[2:53:33 AM]    Serial number of failed request:  45
[2:53:33 AM]    Current serial number in output stream:  46

xuchenhan-tri commented 1 year ago

This happened again in linux-jammy-gcc-bazel-continuous-debug.

INFO:drake:Meshcat listening for connections at http://localhost:7001
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  130 (MIT-SHM)
  Minor opcode of failed request:  3 (X_ShmPutImage)
  Value in failed request:  0x5c0
  Serial number of failed request:  46
  Current serial number in output stream:  47
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-7q4juliq because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

rpoyner-tri commented 1 year ago

7/5: https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/job/linux-focal-unprovisioned-clang-bazel-nightly-everything-debug/813/

7/2: https://drake-jenkins.csail.mit.edu/view/Weekly%20Production/job/linux-focal-clang-bazel-weekly-mosek-debug/166/

BetsyMcPhail commented 1 year ago

7/13: https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/job/linux-jammy-unprovisioned-clang-bazel-nightly-everything-debug/368

mitiguy commented 1 year ago

It seems issue #18726 is causing problems with CI failures in PR #19803 Here is text from the error message that I see for linux-jammy-gcc-bazel-experimental-debug:

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-sco6bphl because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

Running tests...

INFO:drake:Meshcat listening for connections at http://localhost:7010/ X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 130 (MIT-SHM) Minor opcode of failed request: 3 (X_ShmPutImage) Value in failed request: 0x5c0 Serial number of failed request: 46 Current serial number in output stream: 47 Matplotlib created a temporary config/cache directory at /tmp/matplotlib-iwm76bst because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

xuchenhan-tri commented 1 year ago

7/26: https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/job/linux-focal-clang-bazel-continuous-debug/3594/

EricCousineau-TRI commented 1 year ago

7/27: https://drake-jenkins.csail.mit.edu/view/Production/job/linux-focal-clang-bazel-continuous-debug/3594/

svenevs commented 10 months ago

Came up with https://github.com/RobotLocomotion/drake/pull/20621 and https://drake-cdash.csail.mit.edu/test/1322485744 issues in drake/examples/hardware_sim/hardware_sim_cc, relaunched.

RussTedrake commented 10 months ago

Came up again here in #20620: https://drake-cdash.csail.mit.edu/test/1325106110 again in drake/examples/hardware_sim/hardware_sim.cc

svenevs commented 10 months ago

Today: linux-focal-gcc-bazel-continuous-release/5160 and linux-focal-clang-bazel-continuous-debug/4201

jwnimmer-tri commented 10 months ago

We've also been seeing elevated rates of this flake in our TRI-internal CI. \CC @ggould-tri FYI.

If I had to guess, I'd say that the proximate cause would be either: (1) Some Ubuntu library has been upgraded out from underneath us. (2) Some RenderEngineVtk change increased brittleness (e.g., #20385, #20492) -- but those are old enough that they don't seem to correlate with these elevated rates.

ggould-tri commented 10 months ago

FYI: Our internal project is seeing this error on a reliably different request code, which may indicate a different problem. I recommend looking at the X source on the other end of the RPC call to see under what situations calls to X_ShmPutImage could lead to BadShmSeg (start here https://www.x.org/releases/X11R7.7/doc/xextproto/shm.html ; this is shared-memory IPC so maybe a server process crash?). If it is the absence of a relevant display, it might be the same problem.

For our internal X failure problem (that may be unrelated) I have not yet found a correlation with an ubuntu library change, and I suspect a race between a test that is crashing or otherwise forcing a restart of a display-supporting process and another one that is detecting and using that display; as such, it is nearly always our longest-running display-requiring tests that shows the failure.

ggould-tri commented 10 months ago

Again: https://drake-jenkins.csail.mit.edu/job/linux-focal-gcc-bazel-experimental-everything-release/10881/consoleFull

svenevs commented 10 months ago

https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/job/linux-jammy-gcc-bazel-continuous-debug/1343/consoleFull and https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/job/linux-focal-clang-bazel-continuous-debug/4235/consoleFull as well.

rpoyner-tri commented 9 months ago

I've given up trying to log all of these, but they turn up a lot in PRs. https://drake-jenkins.csail.mit.edu/job/linux-focal-gcc-bazel-experimental-everything-release/11393/consoleFull

rpoyner-tri commented 8 months ago

https://drake-jenkins.csail.mit.edu/job/linux-jammy-clang-bazel-experimental-release/413/

BetsyMcPhail commented 7 months ago

https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/job/linux-jammy-clang-bazel-continuous-debug/87/

xuchenhan-tri commented 7 months ago

https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-everything-release/92/