RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.25k stars 1.25k forks source link

vtk: RenderEngineVtk can cause Xorg server to crash on certain machines / GPUs / drivers #14022

Closed EricCousineau-TRI closed 1 year ago

EricCousineau-TRI commented 4 years ago

Filed on Anzu initially (Anzu 5388), but was able to reproduce this in pure Drake unittests.

Background

I had a test that would instantiate different Diagram, Simulator pairs, and in the diagram was a SceneGraph with a registered RenderEngineVtk. On the first instantiation, rendering and all that would be fine (I could render as many times as I wanted). However, on the second instantiation, I would get a BadValue error on (GLX, X_GLXCreateContext), and it would crash Xorg.

This would only happen on CI machines. On my laptop and desktop, I did not receive this error.

CI Configuration

Workaround

The workaround is to keep make a "primer" RenderEngineVtk instance, render with it once, and keep it alive for the duration of the program. Most likely, because VTK uses a "scoped singleton" setup (e.g. on first render, allocate GLX context; on destruction of last renderer, deallocate; then reallocate next time someone wants something).

Min Repro

With the following code on 1392df106 (statically or dynamically linked), --use_primer=false can reproduce the error; --use_primer=true can work around it.

Also on this commit: https://github.com/EricCousineau-TRI/drake/tree/65e41868549ec4c437bed575a7f255a76b0a62d6/tmp (branch: issue-anzu5388-wip)

#include <gflags/gflags.h>

#include "drake/common/text_logging.h"
#include "drake/common/unused.h"
#include "drake/geometry/render/camera_properties.h"
#include "drake/geometry/render/render_engine_vtk_factory.h"

DEFINE_bool(use_primer, true, "");
DEFINE_int32(count, 2, "");
DEFINE_int32(render_count, 3, "");

using namespace drake;
using namespace drake::geometry;
using namespace drake::geometry::render;
using namespace drake::systems::sensors;

void EmptyRender(int render_count) {
  auto renderer = MakeRenderEngineVtk(RenderEngineVtkParams());
  CameraProperties camera_prop(
      640, 480, M_PI / 4, "doesn't matter");
  ImageRgba8U image(camera_prop.width, camera_prop.height);

  for (int r = 0; r < render_count; ++r) {
    drake::log()->info("  Render {}", r);
    renderer->RenderColorImage(camera_prop, false, &image);
  }
}

int main(int argc, char* argv[]) {
  gflags::ParseCommandLineFlags(&argc, &argv, true);

  std::unique_ptr<RenderEngine> renderer;
  if (FLAGS_use_primer) {
    drake::log()->info("Priming...");
    renderer = MakeRenderEngineVtk(RenderEngineVtkParams());
    CameraProperties camera_prop(4, 3, M_PI / 4, "");
    ImageRgba8U image(camera_prop.width, camera_prop.height);
    renderer->RenderColorImage(camera_prop, false, &image);
  }

  for (int i = 0; i < FLAGS_count; ++i) {
    drake::log()->info("i: {}", i);
    EmptyRender(FLAGS_render_count);
  }
  drake::log()->info("[ Done ]");
  return 0;
}

Thanks to @jwnimmer-tri and @SeanCurtis-TRI for helping w/ debugging (and rubber ducking!)

Setting priority to low since we have a workaround.

Per convo w/ Jeremy, hope is that upgrading to latest VTK magically fixes this...

EricCousineau-TRI commented 4 years ago

Output from runs:

$ bazel run //tmp:repro_min_cc                                 
...
[2020-09-05 14:59:58.570] [console] [info] Priming...                                                  
[2020-09-05 14:59:58.672] [console] [info] i: 0                                                        
[2020-09-05 14:59:58.673] [console] [info]   Render 0                                                  
[2020-09-05 14:59:58.691] [console] [info]   Render 1                                                  
[2020-09-05 14:59:58.692] [console] [info]   Render 2                                                  
[2020-09-05 14:59:58.696] [console] [info] i: 1                                                        
[2020-09-05 14:59:58.697] [console] [info]   Render 0                                                  
[2020-09-05 14:59:58.714] [console] [info]   Render 1                                                  
[2020-09-05 14:59:58.715] [console] [info]   Render 2                                                  
[2020-09-05 14:59:58.718] [console] [info] [ Done ]                                                    
ubuntu@ip-10-100-3-181:~/workspace/drake$ bazel run //tmp:repro_min_cc -- --use_primer=false           
...
[2020-09-05 15:00:16.977] [console] [info] i: 0                                                        
[2020-09-05 15:00:16.978] [console] [info]   Render 0                                                  
[2020-09-05 15:00:17.083] [console] [info]   Render 1                                                  
[2020-09-05 15:00:17.085] [console] [info]   Render 2                                                  
[2020-09-05 15:00:17.090] [console] [info] i: 1                                                        
[2020-09-05 15:00:17.090] [console] [info]   Render 0                                                  
X Error of failed request:  BadValue (integer parameter out of range for operation)                    
  Major opcode of failed request:  154 (GLX)                                                           
  Minor opcode of failed request:  3 (X_GLXCreateContext)                                              
  Value in failed request:  0x0                                                                        
  Serial number of failed request:  61
  Current serial number in output stream:  62

$ systemctl status xorg.service
● xorg.service - X Server
   Loaded: loaded (/lib/systemd/system/xorg.service; enabled; vendor preset: enabled)
   Active: failed (Result: core-dump) since Thu 2020-09-03 01:40:16 UTC; 1min 12s ago
  Process: 105854 ExecStart=/usr/bin/X :0 (code=dumped, signal=ABRT)
...
$ cat /var/log/Xorg.0.log
...
[ 42183.080] (EE) Backtrace:
[ 42183.080] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4d) [0x55c4afb49a9d]
[ 42183.080] (EE) 1: /usr/lib/xorg/Xorg (0x55c4af991000+0x1bc839) [0x55c4afb4d839]
[ 42183.080] (EE) 2: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f67d65cc000+0x128a0) [0x7f67d65de8a0]
[ 42183.080] (EE) 3: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (0x7f67d2f6d000+0x4790ec) [0x7f67d33e60ec]
[ 42183.080] (EE) 
[ 42183.080] (EE) Segmentation fault at address 0xac
...
jamiesnape commented 4 years ago

Have you tried a P2 or P3 instance at all?

jwnimmer-tri commented 3 years ago

Not to my knowledge, no. We do use P3 for deep learning stuff, but for basic image rendering we'd prefer if the G3 family works -- much more cost-effective.

jwnimmer-tri commented 3 years ago

Per convo w/ Jeremy, hope is that upgrading to latest VTK magically fixes this...

Retrieving a bit of history -- it looks like when this was filed in September 2020, we were using VTK 8.2 so the "upgrade to latest" would be referring to VTK 9 (#13253), which has not yet started.

jwnimmer-tri commented 1 year ago

Closing for lack of current reproducer.