allenai / ai2thor-rearrangement

πŸ”€ Visual Room Rearrangement
https://ai2thor.allenai.org/rearrangement
Apache License 2.0
101 stars 19 forks source link

Active Neural SLAM implementation #13

Closed Andrewzh112 closed 3 years ago

Andrewzh112 commented 3 years ago

Could you guys also include the ANS implementation that was mentioned in the paper?

Thanks!

Lucaweihs commented 3 years ago

Hi @Andrewzh112 ,

We are working on making this code ready for release. Before this occurs, if you are interested in generating ground-truth semantic maps please see this discussion.

Andrewzh112 commented 3 years ago

Hi @Andrewzh112 ,

We are working on making this code ready for release. Before this occurs, if you are interested in generating ground-truth semantic maps please see this discussion.

Thanks!

Lucaweihs commented 3 years ago

I'm going to reopen issue in case anyone else is interested in this, I'll close again once the implementation is public.

Lucaweihs commented 3 years ago

HI @Andrewzh112 , these experiments have been merged in, see PR #14 . I haven't uploaded the pretrained model weights just yet but those will be coming shortly. Note that you'll have to update your version of AllenAct to the newest version (0.2.3) as that's where I've distributed the ActiveNeuralSLAM model (relevant PR).

Note that I've put a decent amount of work into making the mapping sensors efficient (GPU accelerated) but they are still noticeably slower than running without them. I get around 100-150 FPS during training when running on several GPUs. Let me know if you have any questions.

ugurbolat commented 3 years ago

@Lucaweihs thanks for the implementation.

As far as I looked at the example script and active_neural_slam.py, you provided a semantic mapping capability by introducing extra 70 channels (i.e., 210x210x72). And, I assume that no object segmentation and detection is implemented with some like MaskRCNN/etc. yet? Could you correct my understanding on this point?

Lucaweihs commented 3 years ago

Hi @ugurbolat, if I'm understanding you correctly, yes that's right. We don't do any explicit pixel-to-pixel semantic segmentation currently. That said, AI2-THOR can provide ground truth instance/semantic segmentation frames. While you shouldn't give these frames directly to the agent at inference time (we only allow agents access to RBG+depth for the challenge) you could use these to fine-tune a MaskRCNN.

ugurbolat commented 3 years ago

@Lucaweihs thanks for the quick reply.

To be more precise, I plan to use only the semantic mapping capability of the Active Neural SLAM (AN-SLAM) for evaluating the quality of the predicted map without getting into action part such as navigation and planning.

For example, the agent should build a semantic map for both the walkthrough and unsuffle sessions and compare those two maps. And, compare with the ground truth for evaluation. One downside is that for my experiments, the navigation actions for exploration should be given. The complete ReArrangement task is too complex at the moment for meπŸ˜…

And, I've noticed that AllenAct and AI2-Thor are well-written frameworks and you already implemented AN-SLAM's semantic map as a baseline and the nice utility functions for top-down mapping. In the future, I would like to integrate benchbot with AllenAct as an extra environment since I find your framework quite modular.

Lucaweihs commented 3 years ago

@ugurbolat it would be fantastic to have benchbot integrated into AllenAct. I was optimistically thinking I might do this myself at some point (especially given the RVSU challenge) but I just don't have the time given other projects/commitments. I'm happy to provide any support you might in this regard though.

One downside is that for my experiments, the navigation actions for exploration should be given.

Gotcha, in case you haven't seen it, we have a test in AllenAct (see here) that might be useful if you'd like to see one way to generate the AN-SLAM map outside of AllenAct's train/test functions.

Also, if it's relevant to you, we do support having the agent follow expert actions during training. See, for instance, the projects/objectnav_baselines/experiments/objectnav_mixin_dagger.py file, in particular the line

                    teacher_forcing=LinearDecay(startp=1.0, endp=1.0, steps=tf_steps,),

results in the agent following the expert's actions for tf_steps training steps. In theory you could design your task so that the expert actions were just your pre-determined navigation actions.

Let me know if I can be of any more help.

ugurbolat commented 3 years ago

@Lucaweihs I initially got interested in benchbot especially because of the Scene Change Detection challenge but AI-Thor seems like a better choice for my initial experiment since it is more light-weight. And, I want to see if I can simplify the ReArrangement task into the Scene Change Detection task as I consider the latter is a subset problem of the former.

Thanks for the leading points. That seems like exactly what I was looking for.

Let me dig more into those.

Lucaweihs commented 3 years ago

Gotcha @ugurbolat , yes that change should be pretty straightforward (in theory!).

ugurbolat commented 3 years ago

we do support having the agent follow expert actions during training.

@Lucaweihs I've experimented a bit with expert actions provided from GreedyUnshuffleExpert. Since the actions are for ReArrangement task, I am not sure if it would be an optimal path for exploring the environment to build a semantic map or fine-tune the MaskRCNN. How should I approach if I want to record/build manually trajectory that explores all scenes so that I can create a training dataset?

Lucaweihs commented 3 years ago

Hi @ugurbolat, that's an interesting question. I suspect you'd like your trajectory to be such that you exhaustively explore the environment and see all of the objects, correct? If I were going to do this I think I would create a new heuristic "expert" which followed a simple greedy strategy that ensured the agent saw every object. Namely I would create a seen_objects set to store all of the objects my agent has seen so far and then, in a loop,:

  1. Grab all of the objects from the THOR metadata and use this metadata to see which object not in my seen_objects set is closest to my agent's current position. Call this object obj.
  2. Similarly as in the GreedyUnshuffleExpert, use lines similar to
        interactable_positions = env._interactable_positions_cache.get(
            scene_name=env.scene, obj=obj, controller=env.controller,
        )

    to figure out which positions obj is visible from.

  3. Use a ShortestPathNavigatorTHOR object (again as in GreedyUnshuffleExpert) to find the next action that would take me to the closest interactable position.
  4. Once the object is visible, record this in the seen_objects set and repeat.

Once you've built this expert you have a few different options for training your mapping / the MaskRCNN:

  1. You can set teacher_forcing in the TrainingPipeline to equal something like LinearDecay(steps=training_steps, startp=1, endp=1) (i.e. always follow the expert action) and then just train your auxiliary models using whichever losses you like as usual.
  2. You can build an offline dataset (you'll need to write your own script to generate this dataset by following the heuristic expert's actions) and then either:
    • Follow our off-policy training tutorial to get a sense of how to train your agent using AllenAct with off-policy data, or
    • Simply use the data you've collected with some existing external training code (e.g. I imagine there's a lot of code out there made specifically for training MaskRCNN models and, while you could reimplement this in AllenAct, it might be more practical to use that existing code directly).

Let me know if that helps or if anything is unclear :)!