Closed snitchyang closed 2 months ago
The MLP and View encoder are simultaneously mapping and navigating without any finetuning. The MLP and View encoder were only pre-trained on HM3D datasets and have never trained on any navigation scenes from Matterport3D datasets. The VLN agent must be able to perform navigation tasks in unseen environments, so being fine-tuned in a specific scene before the task execution is not allowed.
A core feature of the HNR model is that it does not require fine-tuning for specific scenes. In new navigation scenes, it only needs to collect the agent's observations for inference. Real-time and generalizable are critical requirements for similar works https://github.com/MrZihan/Sim2Real-VLN-3DFF , https://geff-b1.github.io/ for embodied tasks.
Thank you! Does it mean that it will be hard for the agent to try to navigate to an unobserved position(e.g. something in another room) because volumn rendering can only generate future representations in an area that has already been observed?
For example, if the goal is out of the yellow area, will the robot be confused which candidate waypoint to choose because none of them rendered a matching future feature with language instruction?
In a way, yes. Accurately predicting unobserved areas is indeed a challenge, but with the help of diffusion models, it might be possible to attempt predictions for these unobserved areas, using the already observed 3D features as the condition. This could be the goal for the next step.
Currently, in HNR, I attempted to enhance the prediction capability for unobserved regions by using a view encoder and local semantic alignment supervision (similar to Kaiming He's MAE, which can use 30% of the existing patches to predict the other 70%). This approach has shown some effectiveness, but it has not fundamentally solved the problem.
Thanks for your kind reply!
Hi! I wonder whether the finetuning stage of MLP for volumn rendering is before the task execution, or is it finetuned online while navigating? In other words, is it simultaneously mapping and navigating or is it finetuned in a specific scene before language navigation task in that scene?