dynamic objects handling + stasis assumption + LMMs interest?

Hi @nathanhhughes,

Thanks for the great work with hydra!

I have questions arising from the recent interest our robotics laboratory is having in the toolkit:

in #1 and #31 there are mentions of dynamic objects handling, deriving from kimera-semantics: 1) how are they handled in hydra (e.g., are they filtered out from the reconstructed 3D scene graph or an agent pose graph is being created)? 2) is it stopping at human beings or are there multiple classes of dynamic objects being detected?
in one of the papers (cannot remember if hydra's or hydra-multi's) you mention relaxing the stasis assumption as future work, meaning relaxing the assumptions that objects and the disposition of the scene remains static during the registration: is the assessment still correct or have there been on-going efforts in hydra about this limitation?
what's your view of large multi-modal models (LMMs) and do you see a place for them in hydra in the future allowing a deeper characterization of objects (e.g., material, color, weight, affordances) in the scene, as recent works like conceptgraphs have tried to do, albeit at a much smaller 3D scene graph scale?

The latter would be crucial IMO for enabling efficient and complex high-level planning of heterogeneous robots in large-scale indoor environments following the foundation of works like sayplan or similar.

I am curious about your view on this because at our laboratory, drawing inspiration from hydra, we have been working on extending and improving a conceptgraphs-inspired 3D scene graph generation pipeline, combining some of your geometric approaches for tree decomposition and iterative scene graph construction, registration and optimization, with more and more efficient LMMs (a fine-tuned cogVLM ONNX export) for generating more informative object descriptors and estimating room descriptors. As a workaround on the first two bullet points we are filtering out objects the LMM deemes as dynamic from the global scene graph and are keeping a somewhat informative room descriptor aggregating its objects descriptors, enabling efficient high-level cross-room planning with LLMs using the partial knowledge of what objects characterize any given room and how, without bothering knowing in advance the specific object poses in the graph, especially for those that might be easily moved around. The same process being repeated at a higher-level in the hierarchy, aggregating room descriptors into floor plan descriptors. A small, local and densely accurate scene graph is being created in real-time for any given room once a robot enters it or requires access to it.

Of course, we are at a much more embryonic stage, and not at real-time runtime (yet not too far away either), but I was interested in your view on these (arguably challenging) topics, and on how they're being handled in hydra or if they are currently being addressed in the research roadmap behind hydra.

Thanks in advance for taking the time to answer this.

Hi, thanks for your interest in our work and the great questions!

1) how are they handled in hydra (e.g., are they filtered out from the reconstructed 3D scene graph or an agent pose graph is being created)? 2) is it stopping at human beings or are there multiple classes of dynamic objects being detected?

We simply do not integrate any dynamic classes of semantics into the underlying metric semantic reconstruction that is used to build the scene graph. We specify which of our closed set concepts are dynamic beforehand (humans or otherwise depending on the 2D semantic segmentation source). We don't track dynamic objects as agents for Hydra, that was the focus of the recently published follow-on work Khronos

is the assessment still correct or have there been on-going efforts in hydra about this limitation?

Great question! This was also the focus of the follow-on work by Lukas and co. in Khronos (which is very much worth the read!)

what's your view of large multi-modal models (LMMs) and do you see a place for them in hydra in the future allowing a deeper characterization of objects (e.g., material, color, weight, affordances) in the scene, as recent works like conceptgraphs have tried to do, albeit at a much smaller 3D scene graph scale?

We've explored this topic a little bit with Clio and Language-Enabled Spatial Ontologies, but there's obviously still a lot of room and work to be done to understand how best to use VLMs and LLMs to semantically enrich scene graphs! The approach you've outlined sounds interesting, please feel free to reach out if you have any updates you'd like to share!

MIT-SPARK / Hydra

dynamic objects handling + stasis assumption + LMMs interest? #46