Open BetsyMcPhail opened 1 year ago
See slack discussion: https://drakedevelopers.slack.com/archives/C270MN28G/p1675172012911669
After comparing package versions on a working job and a failing job, we found that updating libglapi-mesa
, which updates libegl-mesa0
, libgbm1
, libgl1-mesa-dri
, and libglx-mesa0
and installs libllvm15
, causes the failure.
We don't directly install any of these gl/mesa packages, instead they are installed as dependencies of xvfb
as part of the CI setup. Therefore, none of the gl/mesa packgaes or xvfb
are installed on the unprovisioned images. If we simply try to apt-mark hold
the gl/mesa packages, xvfb
can't be installed.
I ran the following commands to update the unprovisioned Jammy image:
sudo apt-get update # Do NOT upgrade
sudo apt-get install <package name>=22.0.1-1ubuntu2 # For the packages listed below
sudo apt-mark hold libglapi-mesa libgbm1 libegl-mesa0 libgl1-mesa-dri libglx-mesa0
sudo apt-get install xvfb # Should see message about packages *not* being upgraded
sudo apt-get update
sudo apt-get upgrade
Then, fix the prompt issue (slack thread) by following the instructions here.
xvfb
and its dependencies are already installed on the provisioned Jammy image, so that only needed:
sudo apt-mark hold libglapi-mesa libgbm1 libegl-mesa0 libgl1-mesa-dri libglx-mesa0
sudo apt-get update
sudo apt-get upgrade
Note: Both updated images are dated 2023-02-02
Last night, Jammy unprovisioned builds failed with the error:
[10:23:14 AM] Errors were encountered while processing:
[10:23:14 AM] shim-signed
[10:23:14 AM] needrestart is being skipped since dpkg has failed
[10:23:15 AM] E: Sub-process /usr/bin/dpkg returned an error code (1)
As this was also an error on Jammy unprovisioned images, adding it to this ticket.
Slack discussion here: https://drakedevelopers.slack.com/archives/C270MN28G/p1678114026913369
mkdir /dev/shm
Are these packages still marked "on hold" in CI jobs? I didn't see any code in drake-ci
that reflects the hold.
Are these packages still marked "on hold" in CI jobs? I didn't see any code in drake-ci that reflects the hold.
Given my experience with https://github.com/RobotLocomotion/drake-blender/pull/10, I suspect that the hold is still required. I suppose the call to action for this issue, then, is to reflect the required package configuration details back into either the drake-ci
scripting or at worst some CI documentation.
Are these packages still marked "on hold" in CI jobs? I didn't see any code in
drake-ci
that reflects the hold.
They are on hold in the Jammy AMIs. The drake-ci
scripts haven't been updated to reflect this, but they should be.
An linux-jammy-gcc-bazel-continuous-release
instance today for //bindings/pydrake/visualization:py/model_visualizer_test
:
[11:46:25 AM] FAIL: //bindings/pydrake/visualization:py/model_visualizer_test (see /media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-continuous-release/_bazel_ubuntu/e49b56940e12c1ddd58d01fade4e7033/execroot/drake/bazel-out/k8-opt/testlogs/bindings/pydrake/visualization/py/model_visualizer_test/test.log)
[11:46:25 AM] ==================== Test output for //bindings/pydrake/visualization:py/model_visualizer_test:
[11:46:25 AM] Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dv8v3sr5 because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[11:46:25 AM]
[11:46:25 AM] Running tests...
[11:46:25 AM] ----------------------------------------------------------------------
[11:46:25 AM] .INFO:drake:Meshcat listening for connections at http://localhost:7003/
[11:46:25 AM] X Error of failed request: BadValue (integer parameter out of range for operation)
[11:46:25 AM] Major opcode of failed request: 130 (MIT-SHM)
[11:46:25 AM] Minor opcode of failed request: 3 (X_ShmPutImage)
[11:46:25 AM] Value in failed request: 0x5c0
[11:46:25 AM] Serial number of failed request: 46
[11:46:25 AM] Current serial number in output stream: 47
[11:46:25 AM] ================================================================================
Another linux-focal-clang-bazel-nightly-everything-debug:
[2:53:33 AM] .INFO:drake:Meshcat listening for connections at http://localhost:7000/
[2:53:33 AM] X Error of failed request: BadValue (integer parameter out of range for operation)
[2:53:33 AM] Major opcode of failed request: 130 (MIT-SHM)
[2:53:33 AM] Minor opcode of failed request: 3 (X_ShmPutImage)
[2:53:33 AM] Value in failed request: 0x5c0
[2:53:33 AM] Serial number of failed request: 45
[2:53:33 AM] Current serial number in output stream: 46
This happened again in linux-jammy-gcc-bazel-continuous-debug.
INFO:drake:Meshcat listening for connections at http://localhost:7001
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 130 (MIT-SHM)
Minor opcode of failed request: 3 (X_ShmPutImage)
Value in failed request: 0x5c0
Serial number of failed request: 46
Current serial number in output stream: 47
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-7q4juliq because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
It seems issue #18726 is causing problems with CI failures in PR #19803 Here is text from the error message that I see for linux-jammy-gcc-bazel-experimental-debug:
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-sco6bphl because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Running tests...
INFO:drake:Meshcat listening for connections at http://localhost:7010/ X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 130 (MIT-SHM) Minor opcode of failed request: 3 (X_ShmPutImage) Value in failed request: 0x5c0 Serial number of failed request: 46 Current serial number in output stream: 47 Matplotlib created a temporary config/cache directory at /tmp/matplotlib-iwm76bst because the default path (/home/ubuntu/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Came up with https://github.com/RobotLocomotion/drake/pull/20621 and https://drake-cdash.csail.mit.edu/test/1322485744 issues in drake/examples/hardware_sim/hardware_sim_cc
, relaunched.
Came up again here in #20620: https://drake-cdash.csail.mit.edu/test/1325106110 again in drake/examples/hardware_sim/hardware_sim.cc
We've also been seeing elevated rates of this flake in our TRI-internal CI. \CC @ggould-tri FYI.
If I had to guess, I'd say that the proximate cause would be either: (1) Some Ubuntu library has been upgraded out from underneath us. (2) Some RenderEngineVtk change increased brittleness (e.g., #20385, #20492) -- but those are old enough that they don't seem to correlate with these elevated rates.
FYI:
Our internal project is seeing this error on a reliably different request code, which may indicate a different problem. I recommend looking at the X source on the other end of the RPC call to see under what situations calls to X_ShmPutImage
could lead to BadShmSeg
(start here https://www.x.org/releases/X11R7.7/doc/xextproto/shm.html ; this is shared-memory IPC so maybe a server process crash?). If it is the absence of a relevant display, it might be the same problem.
For our internal X failure problem (that may be unrelated) I have not yet found a correlation with an ubuntu library change, and I suspect a race between a test that is crashing or otherwise forcing a restart of a display-supporting process and another one that is detecting and using that display; as such, it is nearly always our longest-running display-requiring tests that shows the failure.
I've given up trying to log all of these, but they turn up a lot in PRs. https://drake-jenkins.csail.mit.edu/job/linux-focal-gcc-bazel-experimental-everything-release/11393/consoleFull
What happened?
Some tests begain failing in Jammy unprovisioned jobs on January 30th. For example, https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/job/linux-jammy-unprovisioned-gcc-bazel-nightly-release/201/
The failing tests are:
//bindings/pydrake/geometry:py/scene_graph_test
//bindings/pydrake/visualization:py/model_visualizer_test
//bindings/pydrake/visualization:py/video_test
//examples/hardware_sim:py/hardware_sim_cc_test
//examples/hardware_sim:py/hardware_sim_py_test
//geometry/render_gltf_client:py/acceptance_test
//geometry/render_gltf_client:py/integration_test
//tutorials:py/rendering_multibody_plant_test
They all fail with the message:
Version
No response
What operating system are you using?
Ubuntu 22.04
What installation option are you using?
No response
Relevant log output
No response