gazebosim / gazebo-classic

Gazebo classic. For the latest version, see https://github.com/gazebosim/gz-sim
http://classic.gazebosim.org/
Other
1.21k stars 484 forks source link

Race condition leading to non responsive server when writing heightmap cache files #3299

Open scpeters opened 1 year ago

scpeters commented 1 year ago

Environment

Description

Steps to reproduce

  1. rm -rf ~/.gazebo/paging
  2. wget https://github.com/gazebosim/gazebo-classic/raw/gazebo11/worlds/lensflare_wideangle_cam.world
  3. gazebo --verbose lensflare_wideangle_cam.world

Output

Failed run

Screen Shot 2023-02-10 at 12 04 37 PM
$ rm -rf ~/.gazebo/paging/
$ gazebo --verbose worlds/lensflare_wideangle_cam.world -u
Gazebo multi-robot simulator, version 11.12.0
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

[Msg] Waiting for master.
Gazebo multi-robot simulator, version 11.12.0
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

[Msg] Waiting for master.
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.27.1.28
[Msg] Loading world file [/home/scpeters/clone/gazebo/worlds/lensflare_wideangle_cam.world]
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_near_heightmap/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_near_heightmap/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_above_heightmap/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_above_heightmap/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_low/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_low/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_high/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_high/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_higher/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_higher/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.27.1.28
[Msg] Loading heightmap: heightmap_bowl
[Msg] Heightmap loaded. Process took: 0.214018 seconds
[Msg] Saving heightmap cache data to /home/scpeters/.gazebo/paging/heightmap_bowl
[Msg] Loading heightmap: heightmap_bowl
[Msg] Loading heightmap cache data: /home/scpeters/.gazebo/paging/heightmap_bowl/gazebo_terrain_LOD0_00000000.dat
[Msg] Heightmap cache data saved. Process took: 0.285364 seconds.

Subsequent successful run

$ gazebo --verbose worlds/lensflare_wideangle_cam.world -u
Gazebo multi-robot simulator, version 11.12.0
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

[Msg] Waiting for master.
Gazebo multi-robot simulator, version 11.12.0
Copyright (C) 2012 Open Source Robotics Foundation.
Released under the Apache 2 License.
http://gazebosim.org

[Msg] Waiting for master.
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.27.1.28
[Msg] Loading world file [/home/scpeters/clone/gazebo/worlds/lensflare_wideangle_cam.world]
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_near_heightmap/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_near_heightmap/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_above_heightmap/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_above_heightmap/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_low/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_low/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_high/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_high/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_higher/link/camera_sensor_lensflare/image]. Did you forget to start the discovery service?
Node::Advertise(): Error advertising topic [default/wide_angle_cameras_occluded_higher/link/camera_sensor_without_lensflare/image]. Did you forget to start the discovery service?
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.27.1.28
[Msg] Loading heightmap: heightmap_bowl
[Msg] Loading heightmap cache data: /home/scpeters/.gazebo/paging/heightmap_bowl/gazebo_terrain_LOD0_00000000.dat
[Msg] Heightmap loaded. Process took: 0.211706 seconds
[Msg] Loading heightmap: heightmap_bowl
[Msg] Loading heightmap cache data: /home/scpeters/.gazebo/paging/heightmap_bowl/gazebo_terrain_LOD0_00000000.dat
[Msg] Heightmap loaded. Process took: 0.211959 seconds

The Node::Advertise() errors occur in both cases, so I think they can be ignored.

scpeters commented 1 year ago

I attached gdb to the gzclient processes to get backtraces of the failing case to see where it is stuck. I also introspected gzserver a bit and it didn't seem to be stuck. It appeared to keep taking physics steps, but there was no output from gz stats. I don't have anything conclusive on what gzserver is doing, but it didn't seem stuck like gzclient is

gdb gzclient

(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffdc825f460, clockid=<optimized out>, expected=0, futex_word=0x7ffdc825f570)
    at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7ffdc825f460, clockid=<optimized out>, mutex=0x7ffdc825f520, cond=0x7ffdc825f548) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=0x7ffdc825f548, mutex=0x7ffdc825f520, abstime=0x7ffdc825f460) at pthread_cond_wait.c:665
#3  0x00007fe6eb242f48 in  () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#4  0x00007fe6eb231668 in Ogre::Terrain::waitForDerivedProcesses() () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#5  0x00007fe6eb235a5e in Ogre::Terrain::~Terrain() () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#6  0x00007fe6eb23615d in Ogre::Terrain::~Terrain() () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#7  0x00007fe6eb247857 in Ogre::TerrainGroup::TerrainSlot::freeInstance() () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#8  0x00007fe6eb2484bb in Ogre::TerrainGroup::handleResponse(Ogre::WorkQueue::Response const*, Ogre::WorkQueue const*) ()
    at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#9  0x00007fe6f1f53b70 in Ogre::DefaultWorkQueueBase::processResponse(Ogre::WorkQueue::Response*) () at /lib/x86_64-linux-gnu/libOgreMain.so.1.9.0
#10 0x00007fe6f1f5739b in Ogre::DefaultWorkQueueBase::processRequestResponse(Ogre::WorkQueue::Request*, bool) ()
    at /lib/x86_64-linux-gnu/libOgreMain.so.1.9.0
#11 0x00007fe6f1f57942 in Ogre::DefaultWorkQueueBase::addRequest(unsigned short, unsigned short, Ogre::Any const&, unsigned char, bool, bool) ()
    at /lib/x86_64-linux-gnu/libOgreMain.so.1.9.0
#12 0x00007fe6eb2461c1 in Ogre::TerrainGroup::loadTerrainImpl(Ogre::TerrainGroup::TerrainSlot*, bool) () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#13 0x00007fe6eb24624f in Ogre::TerrainGroup::loadAllTerrains(bool) () at /lib/x86_64-linux-gnu/libOgreTerrain.so.1.9.0
#14 0x00007fe6f155133d in gazebo::rendering::Heightmap::Load() () at /lib/x86_64-linux-gnu/libgazebo_rendering.so.11
#15 0x00007fe6f1552364 in gazebo::rendering::Heightmap::LoadFromMsg(boost::shared_ptr<gazebo::msgs::Visual const> const&) ()
    at /lib/x86_64-linux-gnu/libgazebo_rendering.so.11
#16 0x00007fe6f15bc0a0 in gazebo::rendering::Scene::ProcessVisualMsg(boost::shared_ptr<gazebo::msgs::Visual const> const&, gazebo::rendering::Visual::VisualType) () at /lib/x86_64-linux-gnu/libgazebo_rendering.so.11
#17 0x00007fe6f15bd0e8 in gazebo::rendering::Scene::PreRender() () at /lib/x86_64-linux-gnu/libgazebo_rendering.so.11
#18 0x00007fe6f293dc02 in  () at /lib/x86_64-linux-gnu/libgazebo_gui.so.11
#19 0x00007fe6f293273c in gazebo::gui::GLWidget::paintEvent(QPaintEvent*) () at /lib/x86_64-linux-gnu/libgazebo_gui.so.11
#20 0x00007fe6f0f1e2b6 in QWidget::event(QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#21 0x00007fe6f0edba66 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#22 0x00007fe6f0ee50f0 in QApplication::notify(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#23 0x00007fe6f233c80a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#24 0x00007fe6f0f16f4a in QWidgetPrivate::sendPaintEvent(QRegion const&) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#25 0x00007fe6f0f17799 in QWidgetPrivate::drawWidget(QPaintDevice*, QRegion const&, QPoint const&, int, QPainter*, QWidgetBackingStore*) ()
    at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#26 0x00007fe6f0eec09d in QWidgetPrivate::repaint_sys(QRegion const&) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#27 0x00007fe6f0f06318 in QWidgetPrivate::syncBackingStore() () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#28 0x00007fe6f0f1ecac in QWidget::event(QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#29 0x00007fe6f0edba66 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#30 0x00007fe6f0ee50f0 in QApplication::notify(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#31 0x00007fe6f233c80a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#32 0x00007fe6f233f488 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#33 0x00007fe6f2394e37 in  () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#34 0x00007fe6ed34417d in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#35 0x00007fe6ed344400 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#36 0x00007fe6ed3444a3 in g_main_context_iteration () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
--Type <RET> for more, q to quit, c to continue without paging--c
#37 0x00007fe6f2394435 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#38 0x00007fe6f233b3ab in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#39 0x00007fe6f2343116 in QCoreApplication::exec() () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#40 0x00007fe6f2946a4b in gazebo::gui::run(int, char**) () at /lib/x86_64-linux-gnu/libgazebo_gui.so.11
#41 0x0000564164182c89 in  ()
#42 0x00007fe6f1755083 in __libc_start_main (main=0x564164182c40, argc=4, argv=0x7ffdc82617b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdc82617a8) at ../csu/libc-start.c:308
#43 0x0000564164182d3e in  ()
(gdb)
scpeters commented 1 year ago

looking a bit more closely at the console output from the failing case:

[Msg] Loading heightmap: heightmap_bowl
[Msg] Heightmap loaded. Process took: 0.214018 seconds
[Msg] Saving heightmap cache data to /home/scpeters/.gazebo/paging/heightmap_bowl
[Msg] Loading heightmap: heightmap_bowl
[Msg] Loading heightmap cache data: /home/scpeters/.gazebo/paging/heightmap_bowl/gazebo_terrain_LOD0_00000000.dat
[Msg] Heightmap cache data saved. Process took: 0.285364 seconds.

I believe these console messages are coming from different processes (gzserver and gzclient), so I'll annotate them with 1 and 2 depending on which process I think they are from:

1 [Msg] Loading heightmap: heightmap_bowl
1 [Msg] Heightmap loaded. Process took: 0.214018 seconds
1 [Msg] Saving heightmap cache data to /home/scpeters/.gazebo/paging/heightmap_bowl
2 [Msg] Loading heightmap: heightmap_bowl
2 [Msg] Loading heightmap cache data: /home/scpeters/.gazebo/paging/heightmap_bowl/gazebo_terrain_LOD0_00000000.dat
1 [Msg] Heightmap cache data saved. Process took: 0.285364 seconds.

I believe the problem is that process that loads the heightmap first starts to write it but doesn't complete the write before the second process starts reading from that folder. It doesn't look like they are both trying to write to the folder, but that one process needs to wait before reading

scpeters commented 1 year ago

we could use flock on Linux to lock the files, but I'm not sure if there's a cross-platform API implemented somewhere else

scpeters commented 1 year ago

we could use flock on Linux to lock the files, but I'm not sure if there's a cross-platform API implemented somewhere else

we can also use a boost API, since gazebo-classic already depends on boost

scpeters commented 1 year ago

we can also use a boost API, since gazebo-classic already depends on boost

but we don't depend on the interprocess component, so that would be a significant change. Still easiest to make a Linux-only fix with flock