ethz-asl / panoptic_mapping

A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.
BSD 3-Clause "New" or "Revised" License
274 stars 31 forks source link

Intermittent panoptic_mapper crashing (segfaulting) #47

Closed abhileshborode closed 2 years ago

abhileshborode commented 2 years ago

Hello @Schmluk

First of all you have a great package over here. I was able to compile your package on my ubuntu 18.04 and was successfully able to run the flatdataset demo files as well.

However when I tried to run the same node on my custom dataset I am facing intermittent segmentation fault issues from the panoptic_mapper side. These segfaults are happening at variable timestamps within the same rosbag file over multiple runs. I have had 1 or 2 successful instances in avoiding the crash over the same rosbag play duration. This is what the stack trace showed:

I0218 00:26:42.016423 24927 single_tsdf_integrator.cpp:120] Allocate: 306ms, Integrate: 1208ms.
I0218 00:26:42.081941 24927 map_manager.cpp:101] Pruned active blocks in 65ms.
Pruned 131 blocks from submap 0 (Unknown) in 65ms.
W0218 00:26:42.095261 24927 single_tsdf_visualizer.cpp:63] No Map to visualize.
I0218 00:26:42.108364 24927 panoptic_mapper.cpp:315] Processed input data.
(tracking: 0 + integration: 1515 + management: 78 + visual: 0 = 1594, frame: 1627ms)
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
*** Aborted at 1645136547 (unix time) try "date -d @1645136547" if you are using GNU date ***
PC: @     0x7f88cc039fb7 gsignal
*** SIGABRT (@0x4e7b) received by PID 20091 (TID 0x7f88697fa700) from PID 20091; stack trace: ***
    @     0x7f88cc03a040 (unknown)
    @     0x7f88cc039fb7 gsignal
    @     0x7f88cc03b921 abort
    @     0x7f88cc690957 (unknown)
    @     0x7f88cc696ae6 (unknown)
    @     0x7f88cc696b21 std::terminate()
    @     0x7f88cc696d54 __cxa_throw
    @     0x7f88cc692837 (unknown)
    @     0x7f88c9404ebe panoptic_mapping::SubmapCollection::getSubmapPtr()
    @     0x7f88c94823b7 panoptic_mapping::SingleTsdfIntegrator::processInput()
    @     0x7f88cc9d4aea panoptic_mapping::PanopticMapper::processInput()
    @     0x7f88cc9d57f4 panoptic_mapping::PanopticMapper::inputCallback()
    @     0x7f88cd7f9dd7 ros::TimerManager<>::TimerQueueCallback::call()
    @     0x7f88cd81e829 ros::CallbackQueue::callOneCB()
    @     0x7f88cd81fb15 ros::CallbackQueue::callOne()
    @     0x7f88cd876f44 ros::AsyncSpinnerImpl::threadFunc()
    @     0x7f88ca8c3bcd (unknown)
    @     0x7f88cb4aa6db start_thread
    @     0x7f88cc11c71f clone

This made me think that I probably have some wrong parameters set due to which the mapper is trying to access a submap which might have gotten already pruned. However I did come across 1-2 instances in which the same node did not crash here is the console ouput from that run. (Note: No parameters were changed for either of runs).

I0218 00:23:16.871966 21811 single_tsdf_integrator.cpp:120] Allocate: 6ms, Integrate: 65ms.
I0218 00:23:16.973881 21811 planning_visualizer.cpp:186] Map lookups based on 1 submaps took 0.3+/-0.2, max 14.3us.
I0218 00:23:17.072490 21811 panoptic_mapper.cpp:315] Processed input data.
(tracking: 0 + integration: 71 + management: 0 + visual: 102 = 174, frame: 456ms)
I0218 00:23:17.122951 21806 single_tsdf_integrator.cpp:120] Allocate: 2ms, Integrate: 44ms.
I0218 00:23:17.229359 21806 planning_visualizer.cpp:186] Map lookups based on 1 submaps took 0.3+/-0.2, max 17.8us.
I0218 00:23:17.317369 21806 panoptic_mapper.cpp:315] Processed input data.
(tracking: 0 + integration: 47 + management: 0 + visual: 107 = 154, frame: 244ms)
I0218 00:23:17.365311 21800 single_tsdf_integrator.cpp:120] Allocate: 2ms, Integrate: 42ms.
I0218 00:23:17.468127 21800 planning_visualizer.cpp:186] Map lookups based on 1 submaps took 0.3+/-0.2, max 24.1us.
I0218 00:23:17.552709 21800 panoptic_mapper.cpp:315] Processed input data.
(tracking: 0 + integration: 44 + management: 0 + visual: 103 = 148, frame: 235ms)
I0218 00:23:17.605499 21806 single_tsdf_integrator.cpp:120] Allocate: 2ms, Integrate: 40ms.
I0218 00:23:17.704886 21806 planning_visualizer.cpp:186] Map lookups based on 1 submaps took 0.3+/-0.6, max 77.6us.
I0218 00:23:17.791018 21806 panoptic_mapper.cpp:315] Processed input data.
(tracking: 0 + integration: 43 + management: 0 + visual: 100 = 143, frame: 238ms) 

Here is the output from the Rviz TSDF pointcloud: Screenshot from 2022-02-17 19-22-54 (1)

However even in this run I could not see any visual output from the voxblox mesh on RViz.

My voxel size is set to 0.05 which I don't think is the issue since I was able to run my same data on the Kimera-Semantic mapping Library (which also uses voxblox on its backend) successfully and got a decent output as well

EG(Output from Kimera mapping package): with_person

Could you please provide me with any hint as to what parameter I should look into (or how should I try to debug this) to get a proper output with the visual generated mesh ? or do you think this issue is caused by something else entirely ? I am attaching my custom config file along with its label csv file as well. (Note: I build my code in Release Mode).

Any help is appreciat custom.csv ed ... thanks!!

custom (1).txt

Schmluk commented 2 years ago

Hi @abhileshborode

Thank you for your interest! Yes, there seem to be some issues with the config. This is a bit a blessing and a curse of modular frameworks, but not all modules are compatible with each other.

For visualization I'd recommend the mesh rather than the TSDF pointcloud. Also worth knowing, panoptic mapping expects the input poses in optical frame, i.e. with z pointing forward, x right, y down.

I hope this helps, let us know how it goes!

abhileshborode commented 2 years ago

Hi @Schmluk ,

Thanks for the advise.
I tried your recommended steps but still am facing the intermittent node crashes. Initially I was getting a lot of Unable to lookup transform between 'world' and 'camera4_infra1_optical_frame' at time '1637684095.711249113' over '0.1s', skipping inputs. Exception: 'Lookup would require extrapolation into the past. Requested time 1637684095.711249113 but the earliest data is at time 1637684168.967064142, when looking up transform from frame [camera4_infra1_optical_frame] to frame [world]'. (My bag is running with the --clock arg along with use_sim_time true)

Even though my /tf was publishing at 40hz and rosrun tf tf_echo frame1 frame2 was outputting successfully. I just bypassed the issue by setting the max_input_queue_length: 10.

After that I tried the base configs for single_tsdf and multi-maps but still experienced the crashes sometimes. Even in the instances I had non-crash runs I did not see any understandable/meaningful output from Rviz atleast. The mesh was being published (I checked it via rostopic hz) but almost negligible output showed up on Rviz. (I could only see the free TSDF PCL which by itself I could not tell if it was correct or not).

When you say input poses are expected in optical frame do you mean that the input poses has to be published in the optical frame ? As of now I have my pose in the world frame and I have a /tf which is publishing the transform from world frame to optical frame. (I assumed this package is following the same convention as voxblox)

I think I am trying to use this package incorrectly at this point considering the way I am using it which might be causing these crashes. I am inputting an infrared image("mono8" ) instead of "bgr8" / depth image("16UC1") instead of "32FC1" and a segmented image ("bgr8") instead of "32SC1", I made changes to the input_synchronizer.cpp for the change in datatype for their subscribers accordingly. I did not go through how all the data structures are operated on the backend ... do you think these could lead to those crashes ? These are my input args:

    <!-- Input -->
    <remap from="color_image_in" to="/camera4/infra1/image_rect_raw"/>
    <remap from="depth_image_in" to="/camera4/depth_topic/image_rect_raw"/>
    <remap from="segmentation_image_in" to="/camera4/deeplab/mask_overlay"/>
    <!--<remap from="labels_in" to="$(arg namespace)/segmentation_labels"/> -->
    <!--<remap from="pose" to="/odometry/world" /> -->
  </node>

Also I noticed the examples in the package expects the use_detecton flag to be true. Is it possible to run the package without using detectron or equivalent. (I have it set to false).My custom dataset is 80% static classes and the 20% dynamic classes usually only show up as a single instance so I don't really need to run detectron2 verison (which provides instance segmentation layer). My csv file [attached] just labels the panoptic labels accordingly. My semantic segmentation image is an output from the deeplab model. Is it possible to integrate this package with the pure semantic segmented images alone ? If yes could you please advise on how I might go on with it ? I am thinking because I have no instances as an input the id_tracker part of the stack might be causing problems. ??

Looking forward to your reply custom.csv

abhileshborode commented 2 years ago

Hi @Schmluk ,

Just wanted to follow up on this. It seems the accuracy issue and some of the segfaults were coming from the fact that my input depth image was of 16UC1 instead the expected 32FC1. I was able to generate somewhat understandable ouput at this point after doing the appropriate conversion. here are some screenshots: ( I have detectron false for both instances).

Generated Via single_tsdf config mesh-single-tsdf

non_seg_panoptic

I am trying to investigate why are there so many holes in the mesh generated. Could it be voxel size related ? I tried increasing the voxel size for some instances but I am still unable to generate a continuous mesh, also there seems to be only 1 submap mesh generate (as you can see in the image). Is there a parameter for specifically enabling multiple submaps in the single-tsdf config Since I recall the package generating multiple submaps for the flatdataset example using the same config. (I thought I had my mapping for the csv file incorrect but my mulit-tsdf file was generating multiple submaps ) Is this result expected while using single_tsdf config ?

Multi-TSDF config result: class_projective

While keeping cadence to the segfaults I am still intermittently getting this segfault: terminate called after throwing an instance of 'std::out_of_range' what(): _Map_base::at It seems to be arising from return submaps_[id_to_index_.at(id)].get(); in panoptic_mapping/src/map/submap_collection.cpp SubmapCollection::getSubmapPtr And surprisingly if it happens it only happening at the start of the rosnode. Could this be relevant to any specific parameter which is causing node to access an unallocated block ?

Any help is appreciated !!

Schmluk commented 2 years ago

Hi @abhileshborode Apologies for not coming back to you earlier. Regarding your previous questions, yes it is important that all your input is converted to the formats expected py Panoptic Mapping, otherwise all image lookups (which are templated on opencv Mats) will fail. Yes, setting detectron to false will just use the IDs in the input image.

I hope this helps, otherwise could you provide the complete settings you are using and the full failure log?

abhileshborode commented 2 years ago

Hi @Schmluk ,

your suggestion worked. After some little tuning (setting truncation_distance: 0.4)I am able to generate continuous meshs: default_single_tsdf

I had a question regarding the Planning output generated ... Is it possible to mask out certain classes from showing up on the planning slice layer entirely ? (i.e they should not even be considered to begin with). I am asking this because I have many pixel points which are segmented (from the input segmentation image) as background pixels (class name is background)( eg: sky) these pixels usually have very erroneous depth measurements so they show up on the maps often and they get detected as occupied space on the planning slice: eg: The black dots in the image shown below: black

Is it possible to semantically mask out certain classes feed in from the input segmentation image so that they don't get appended / disturb the planning layer or mapping layer (without changing the planning slice resolution so as to maintain planning information for other small classes eg: poles)? Can we rewire the config in some way to get this behavior ?

I also wanted to ask if its possible to modify this package such that it could take in a pointcloud as an input ? (Similar to voxlblox / Kimera) this could enable passing pointclouds from multiple sources ( eg: 2 cameras for wider fov) . Assuming I am providing a /tf from that pointcloud to the world frame at all times ?. I know this would need some development but if possible could you point me to which file will I have to modify ?

Regarding the segfaults I think I have located the cause it came from another erroneous image conversion of the input segmentation image. I am still getting some memory leaks (non crashing leaks) (on htop) while running this package for longer duration though ( on 8-10 min bags). the memory usage grows to 15G along with 300% CPU usage and more. I ran the same stack on the example flat dataset and got like 1G memory usage along with 35-40% CPU usage. So it looks like my input data is causing high CPU / memory usage (which grows over time). Only noticeable difference in my dataset is that it is inputted at 6Hz and my images are 848x640 which by itself I believe should not cause such a drastic computation performance over time. What are your thoughts on this ?

The results from your package have been very great btw when I compared it with Kimera / voxblox. Looking forward to your reply

abhileshborode commented 2 years ago

Hi @Schmluk ,

Just wanted to follow up on this.... I increased the slice height for the planning layer and it mostly filters out background class labels due to erroneous depth measurements occurring on the planning slice. But I happen to see a bit of inconsistent behavior with this.

Example 1: (In this case the planning layer is properly ignoring the background class voxels (Black / Blue voxels )that show up on the map due to erroneous depth. Screenshot from 2022-02-24 18-47-37

Example 2: (In this case it does not ignore background voxel (black / blue voxels )that show up and the planning slice shows that space as occupied

Screenshot from 2022-02-28 00-22-10

Do you have any idea why this must be happening ? I feel like I am misunderstanding how this planing layer is functioning

Schmluk commented 2 years ago

Hi @abhileshborode

Great to hear that it worked for you! Regarding using classes for planning, the planning lookup happens here. There you can add conditions to e.g. skip submaps of specific labels. The planning layer just queries the submap collection and nothing should be ignored, except its changestate is absent or unknown.

Regarding the multiple cameras, it is in principle possible to use any pointcloud, but would need to implement a projective integrator for general pointclouds. This is not yet implemented, but could be similar to the way voxblox does it. What is easy to add is support for multiple cameras. You could e.g. in the input or transform frame se which camera you are using and use the standard pipeline with multiple camera objects.

I hope there should be no memory leaks but if you happen to find one I'd be happy to hear about it!

abhileshborode commented 2 years ago

Hi @Schmluk hope you are doing well. Just wanted to follow up on this thread.

I happened to wrap multiple camera objects inside the global.h class and did the corresponding camera object lookups based on the input set frame and it worked like a charm. The tracker does a really got job of tracking objects within the overlapping field of view of multiple cameras as well.

As far as memory leaks go at this point I don't think there are any. I was using images with 848x480 resolution ... which was just too much data for it to handle which lead to extremely high CPU and memory growth. Applying a decimation factor of 3 prevents this issue.

I did however happen to notice an increase in CPU and throttling in frequency of the planning layer when I happen to enable the generatemesh layer option. And there seems to be increase in the object reconstruction performance alongside the tracking of multiple objects/instances as well .

Here is a snippet with the generate mesh option enabled (link)

Here is a snippet without the generate mesh option enabled (link)

If you notice the output with the mesh enable option is much more stable and the tracker is able to do the data association with its own class object instance submaps much more accurately. (the output visual is the surface pcl generated from the tsdf submaps)

It seems to be coming from submap.upatemesh() line of that function while it is iterating through all of its submap instances . I am trying to understand on how is this different from the regular tsdf integrator update thats within the panoptic_mapper processinput callback ? and why is the output so much more stable after the submap.updatemesh() method ? Especially since that extra submap update step seems to be computationally expensive

I went through your published computational performance metrics. You mentioned that you guys got significantly faster results (almost 30Hz.... Wow!!) on the RIO 224x172 resolution with fewer object detections. What did you mean by fewer object ? Were you guys just performing tracking/integration updates for the object instances of the submaps? Also did this performance validation include the generatemesh update for every input frame ?

abhileshborode commented 2 years ago

Hi again @Schmluk ... If i understand correctly the multi-mesh viz update is trying to query all the submaps within its collection and trying to perform full volumetric reconstruction at its full resolution which seems to be really expensive and it affects the output FPS as well. This update however seems to be key in accurately tracking and constructing scenes with semantic consistency and has a much more gradual temporal decay (for dynamic stuff) as well . I was wondering if it was possible to limit this cool behavior to specific semantic classID submaps only ? so that specific class objects could be fully reconstructed with semantic-temporal consistency instead of doing it on the entire scene.

I was looking into leveraging the Hierarchical structure within the submap to do this for high speed object tracking and reconstruction . I tried looking up the submap by their ClassIds here and only performed an submap.meshupdate() step if it was the submap of semantic interest however it still seemed to update all of its submaps instead. Could you please point me to where/how I could make this change to reconstruct specific objects at its set resolution ? Should this step be added after filtering out your submaps based on the associations made based on the IoU for faster performance? or did I misunderstand this and something else is responsible for doing this object level full reconstruction ?

Schmluk commented 2 years ago

Hi @abhileshborode

Great to hear that this worked for you. Updating the meshes has nothing to do with the TSDF update, i.e. first the TSDF is updated during integration, then the mesh (and therewith the ISO-surface points) are extracted using marching cubes. The iso-surface points are rendered during tracking, i.e. all submaps you want to track need to be meshed. The mesh updates are incremental per block, so should not be expensive if queried too often. Regarding the visualization in RVIZ, this takes the mesh and converts it into an OGRE compatible rviz message, which can be expensive I guess. Filtering by IDs for visualization should definitively be possible. Since for every detected object a submap is allocated, fewer object detections lead to fewer submas being updated in parallel.

abhileshborode commented 2 years ago

Hi @Schmluk .... thanks that makes more sense to me now. I had a quick question regarding which module updates the mesh. I was trying to debug the renderingTrackingInfoApproximate inside projective_id tracking as I was getting 0 tracking matches while using this compared to the rendeTrackingInfoVertices which had good number of tracking matches but rendertrackingVertices seemed to do ray-tracing into each submaps which is a bit inefficient. From what I could see the mesh update step seems to happen here . Upon doing a backtrace while using renderTrackingInfoVertices the mesh update step seems to come from manageSubmapActivity within the map_manager . But I could not locate where exactly did that call come from ? And from what I searched the updatemesh function call is only called from within submap_visualizer. Could you point me to where exactly the mesh update step happens ? I believe the mesh update step is not happening in my case when I am using renderingTrackingInfoApproximate in projective_id which is causing 0 tracking matches while using this compared to using rendertrackingVertices. Since the submaps are not tracked they are also pruned by the maps_manager. Do you think this reasoning makes sense ? Or do you think something else is causing the submap tracking match failures specifically for renderingTrackingInfoApproximate?

Schmluk commented 2 years ago

Yes, if the meshes are not update the tracking does not work. It might actually be that this only happens during visualization, but it should happen also in the main loop. Since this is incremental and cashes the progress you can just add it to the main loop. The renderTrackingInfoVertices was just a test to directly lookup the values but it's less efficient and less accurate.