askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
375 stars 84 forks source link

Does a predicted mask have an impact on predicting an action? #18

Closed bhkim94 closed 4 years ago

bhkim94 commented 4 years ago

Hi, I'm finding out how masks affect action prediction.

I couldn't find equations in the paper that show the relevance between actions and masks. According to the paper, the next action is determined by visual and linguistic features, a previous action, a previous hidden state, and learnable parameters. But no mask is used for the action prediction.

So I'm now following your codes to find out how the masks affect the predictions of actions, but I can't find it.

If I misunderstand the paper, could you help me understand how a mask affects the prediction of an action and what a mask is used for?

Thanks for replying!

MohitShridhar commented 4 years ago

Hi @bhkim94,

The mask prediction is an output, not an input to the model. The mask complements an interactive action like Pickup by indicating what to pickup in the visual frame. The mask is conditioned on the same things as the action.

bhkim94 commented 4 years ago

Thanks for replying.

I don't understand your answer. Does it mean that interactive actions are taken on the masks? If yes, how can we decide which pixel of the masks should be selected to interact? If no, can actions be predicted not based on masks (e.g. masks without any labels)?

Thanks!

MohitShridhar commented 4 years ago

All interactive actions (Pickup, Put, ToggleOn etc.) need a mask. The evaluation API selects the object that contains the most pixels inside the predicted mask, and then applies the action to that object.

You can predict actions without masks, but a Pickup action by itself is not useful unless you can somehow indicate what to pickup.

bhkim94 commented 4 years ago

Thank you. I've found that part in the codes.

There is one more question. Is it possible to give an agent ground-truth masks during validation?

I'm trying to figure out how much segmentation affects the accuracy of a model, but I can't find any options for it.

Is there any options to do it?

Again, thanks for consistent helps.

MohitShridhar commented 4 years ago

During validation, if you follow the exact sequence of expert actions, then you can use the provided ground-truth masks in the dataset. But if you deviate from this sequence (quite likely), the masks won't be helpful. It's hard to get masks in this context, because we can't tell what the model wants to interact with.

On a side note, generating masks is fundamental to solving ALFRED tasks. Typically in navigation tasks like VLN, you don't have to interact with objects in scene so you never have to worry about this. But here, you have to.