Open Chenfeng1271 opened 3 years ago
Hi! Thanks for your sharing! Sounds like a nice idea. Although difficult, I guess it would be very cool if we can somehow evaluate what the agent has learnt during early training stage, when lots of self-exploring is happening, investigate its action pattern, language attention, or type of paths it can solve --- to learn why it helps (even when instructions and paths are not well-aligned).
Besides, one paper might be quite relevant to your idea is "Babywalk: Going farther in vision-and-language navigation by taking baby steps", they apply curriculum learning + sub-instructions in training, and I think they also allow the agent to explore.
Cheers, Yicong
Hi!Today, I have a new idea and I think it is interesting. I read your paper 'The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation' and consider (1) Transformer is so powerful to fuse multiple information, e.g., language, image, objects. (2) why we really need objects or can we use any other alternatives? So, objects may correlate with scene understanding, text landmark, or location information., but I think these are too implicit. I think we need depth information, which is more accurate with agent action across time steps, so the model could have an accurate overview of the indoor environment (of course, we can map it to instances/objects/stuff. Originally I want to solve this task with depth-aware video panorama segmentation. But it is too difficult). The agent could have an accurate depth map for the house and the orientation would benefit from it. You can imagine any benefits like an autonomous car with 3D sensor. This consideration reminds me how could we evaluate the overfitting of models about the environment? Some datasets are small, e.g., R2R, so until the final training epoch, sufficient trials would overfit the environment. But does the agent really know the detail of each room? Or even could build a panoramic graph for the house? I don't think so, instead of denouncing the agent never knowing the cues for sub-instruction with observation, I believe we blame it too much. The training enables the agent to swallow it without chewing. So we should let it understand more about each image it views. Do you think depth information or depth awareness is necessary? I search related papers sharing a similar idea with me. It seems that 'Depth-Guided AdaIN and Shift Attention Network for Vision-And-Language Navigation' already come true the idea. But I think the manner is stupid.
Hi! Thanks for sharing your thoughts and the cool idea!
About the depth features, I totally agree with you that it contains valuable information for preceiving the environment and learning to navigate. And honestly I have tried it with my Recurrent-VLN-BERT, by simply using a MLP to merge (concat and project) RGB features, Depth features and the directional encoding before feeding to the transformer, the results on R2R are improved! So yeah, I agree that it definitely worth trying.
However, I believe that depth features in MP3D (discrete) is not as necessary as in Habitat-MP3D (continuous), because in MP3D the connectivity graph is pre-defined, meaning that the agent does not need to infer accessibility or to decide how far it should go in certain direction. Nevertheless, depth features should help the agent to learn about its transition in space and how far it is to certain landmark (e.g. stop when reach the target, not when see the target).
Certainly, you will need to come up with some clever methods to leverage the depth features. I would say "Depth-Guided AdaIN" is a nice attempt, but probably not very convincing to everyone. Talking about mapping, another idea is to build a semantic 3D map with the help of depth (and semantic segmentation), you can learn about the idea in the work from Devendra Singh Chaplot.
Cheers
Thanks for your amazing sharing. I am a novice to VLN, but still motivated by your ideas. I notice that it is inevitable for an agent to make mistakes, which come mainly from the mismatching of sub-instruction and observer (or location, I think you know what I mean). This issue results from many reasons, such as weak alignment or weak perception. Self-exploring which shares the same idea in RL, wants to access some unobserved information that may be undesired in this instruction but somewhere useful. But it may lack motivation for an agent to remember it since it obviously obtains nothing in this step but is activated in a much longer range. My idea is why not initially let the agent learn from easy instruction or sub-instruction and then focus on the hard one. Of course, this idea is for training a machine as a human curriculum, i.e., Curriculum Learning, and has no conflict with self-exploring. If we combine them, this idea becomes to only explore if the agent truly doesn't how to do. I believe in this way the searching space would be reduced largely (when purely exploring, the space may be large), and more likely as human behavior.