MrZihan / HNR-VLN

Official implementation of Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation (CVPR'24 Highlight).
37 stars 0 forks source link

MLP training problem #8

Closed snitchyang closed 2 months ago

snitchyang commented 2 months ago

Hi! I wonder whether the finetuning stage of MLP for volumn rendering is before the task execution, or is it finetuned online while navigating? In other words, is it simultaneously mapping and navigating or is it finetuned in a specific scene before language navigation task in that scene?

MrZihan commented 2 months ago

The MLP and View encoder are simultaneously mapping and navigating without any finetuning. The MLP and View encoder were only pre-trained on HM3D datasets and have never trained on any navigation scenes from Matterport3D datasets. The VLN agent must be able to perform navigation tasks in unseen environments, so being fine-tuned in a specific scene before the task execution is not allowed.

MrZihan commented 2 months ago

A core feature of the HNR model is that it does not require fine-tuning for specific scenes. In new navigation scenes, it only needs to collect the agent's observations for inference. Real-time and generalizable are critical requirements for similar works https://github.com/MrZihan/Sim2Real-VLN-3DFF , https://geff-b1.github.io/ for embodied tasks.

snitchyang commented 2 months ago

Thank you! Does it mean that it will be hard for the agent to try to navigate to an unobserved position(e.g. something in another room) because volumn rendering can only generate future representations in an area that has already been observed?

snitchyang commented 2 months ago

image For example, if the goal is out of the yellow area, will the robot be confused which candidate waypoint to choose because none of them rendered a matching future feature with language instruction?

MrZihan commented 2 months ago

In a way, yes. Accurately predicting unobserved areas is indeed a challenge, but with the help of diffusion models, it might be possible to attempt predictions for these unobserved areas, using the already observed 3D features as the condition. This could be the goal for the next step.

MrZihan commented 2 months ago

Currently, in HNR, I attempted to enhance the prediction capability for unobserved regions by using a view encoder and local semantic alignment supervision (similar to Kaiming He's MAE, which can use 30% of the existing patches to predict the other 70%). This approach has shown some effectiveness, but it has not fundamentally solved the problem.

snitchyang commented 2 months ago

Thanks for your kind reply!