Some paper and code problems.

BJHYZJ commented 1 month ago

Thank you for your excellent work. After reading your paper and code, I have a few questions and would like to hear your thoughts and guidance on them.

The ability to dynamically update the scene graph during a robot's interaction with the environment is a very interesting concept. The scene graph can clearly record how many steps are required to complete a task. Theoretically, this record of steps would be useful for subsequent tasks, but how should it be utilized? Moreover, if the scene needs to be reconstructed with each exploration, wouldn't that be a significant efficiency issue?
Knowing the relationships between objects, it seems that in the paper and code, only simple relationships (like on, belong, inside) are addressed. While this is sufficient for simple grasping tasks, relying solely on threshold-based methods to process point clouds feels like it would be difficult to generalize and obtain accurate results. Do you think there are better methods to solve this?
The actual usage and how to apply action-condition scene graphs is still an issue. Relying on instance-level scene recording can sometimes make it difficult for the robot to locate a problematic position. For example, in the nodes, a cabinet and a table are completely different objects (possibly due to the grounding algorithm), but in a household scenario, they might be identical objects.

Jianghanxiao commented 1 month ago

Thanks for the kind words and thanks for asking. The questions are valuable and interesting, below are some of my thoughts and answers:

For the utialization of the constructed action-conditioned scene graph, in our supplement, we show three different usage cases, including simply judging object existence, object retrieval planning through traversing the graph from the root to the object node (which reveals the list of action the agent need to do to fetch the objects) and more advanced usage ,similar to the approach proposed by Gu et al.[1], intergrating ACSG into LLM or LMM to enable the robot respond to human preferences expressed in natural language (e.g., fetching a coke when the person is thirsty) or through visual cues (e.g., fetching a mug when the table is dirty). I don't quite get what do you mean by the scene needs to be reconstructed with each exploration. The scene graph can be dynamically updated based on the new observation in the whole exploration process. And after one-time exploration, some human disturbance can be further considered like our experiments of the human intervention. Or are you asking about the reconstruction process in each observation? Actually that process is very fast, we adopt the voxel-based reconstruction. But there can be more design on the hierarchical reconstruction and scene registration which can further accelerate. Feel free to ask if I don't understand your question correctly.

Ref: [1] Gu, Qiao, et al. "Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning." ICRA 2024.

This is a very good question. In our work, we also balance the tradeoff between the relation complexity and the generality of our system. In real life, there can be much more complicated relations even not easily described by human. Therefore, one important thing in our ACSG is to keep the low-level geometry with the high-level semantic graph. Based on our investigation, we get some idea from Behavior task definition, these three spatial relations are the most common three and can deliver a bunch of information. But it's very interesting to figure out some more complicated relation definition, and it can be easily extended into our framework with the existence of low-level geometry. Below are some of my thoughts during the process, we actually consider using LMM to judge the relations among the objects. This can definitely provide more complicated description on the relations. However, it's still an open question on how to leverage such relations in downstream tasks, if just input them into LLM/LMM again, sure, they definitely will help something. But it will be very interesting if there can be some more systematic unifcation on all spatial relations. My current point is that all high-level spatial relations are just some summary of the low-level spatial info. Therefore, another strategy in the future is to just leverage some coordinate-based relations (size, pose of two objects) for the downstream tasks. It's an interesting open question on how to better make use of them. Feel free to discuss if you have some better ideas!
If I understand correctly, the question is on how to make sure the instance in the scene graph is reasonable, for example, sometimes, for some combo-object, cabinet-table may be problematic. That's actually a good point. We cannot make sure the perception system can always give us the best understanding of objects (Even for human, we sometimes cannot figure out the clear category or boundary of an object). But in the meanwhile, existing perception system can give us some reasonable and potentially consistent object info. No matter that's a cabinet or a table, no matter they are counted as one object or two objects, our ACSG store the low-level geometry for each node. For the cabinet-table, the cabinet may be judged as on the table. Even if such understanding is not that accurate, there is no blocking to construct some "correct-like" ACSG. If there is a key in the cabinet, no matter what the graph looks like, as long as it can reveal that the key is in which drawer, it's okay. So potentially one answer is to evaluate whether "accurate" (maybe no accurate at all) is more important or "consistent understanding" is more important. Another answer is on the case that it's true the perception can be never perfect, that's also a reason why we want to leverage LMM on the decision module, it can further help to correct the potential errors raised from the perception module. In addition, we also add some mobile robot experiments, we will also release them in the next-version of our project. Feel free to ask if I understand your question not accurately.

Jianghanxiao commented 2 weeks ago

Close for now, feel free to reopen if you have further questions

Jianghanxiao / RoboEXP

Some paper and code problems. #2