Closed bhkim94 closed 4 years ago
Hi @bhkim94,
The mask prediction is an output, not an input to the model. The mask complements an interactive action like Pickup
by indicating what to pickup in the visual frame. The mask is conditioned on the same things as the action.
Thanks for replying.
I don't understand your answer. Does it mean that interactive actions are taken on the masks? If yes, how can we decide which pixel of the masks should be selected to interact? If no, can actions be predicted not based on masks (e.g. masks without any labels)?
Thanks!
All interactive actions (Pickup
, Put
, ToggleOn
etc.) need a mask. The evaluation API selects the object that contains the most pixels inside the predicted mask, and then applies the action to that object.
You can predict actions without masks, but a Pickup
action by itself is not useful unless you can somehow indicate what to pickup.
Thank you. I've found that part in the codes.
There is one more question. Is it possible to give an agent ground-truth masks during validation?
I'm trying to figure out how much segmentation affects the accuracy of a model, but I can't find any options for it.
Is there any options to do it?
Again, thanks for consistent helps.
During validation, if you follow the exact sequence of expert actions, then you can use the provided ground-truth masks in the dataset. But if you deviate from this sequence (quite likely), the masks won't be helpful. It's hard to get masks in this context, because we can't tell what the model wants to interact with.
On a side note, generating masks is fundamental to solving ALFRED tasks. Typically in navigation tasks like VLN, you don't have to interact with objects in scene so you never have to worry about this. But here, you have to.
Hi, I'm finding out how masks affect action prediction.
I couldn't find equations in the paper that show the relevance between actions and masks. According to the paper, the next action is determined by visual and linguistic features, a previous action, a previous hidden state, and learnable parameters. But no mask is used for the action prediction.
So I'm now following your codes to find out how the masks affect the predictions of actions, but I can't find it.
If I misunderstand the paper, could you help me understand how a mask affects the prediction of an action and what a mask is used for?
Thanks for replying!