askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
360 stars 77 forks source link

Why repeating the same frame to predict <<stop>>? #53

Closed ryokan0123 closed 3 years ago

ryokan0123 commented 3 years ago

Hi,

I have a question about this line. https://github.com/askforalfred/alfred/blob/6d78d8085699da35371c5f65339a470ffcf89a3e/models/model/seq2seq_im_mask.py#L121

If I understand correctly, frames contains an image which conditions the prediction of action_low of the same index. I assume that the im[-1] in this line is the image when executing the last actual action (e.g. Slice an object), but to predict the stop action, it would be natural to use a new image after the last actual action. Why does it repeat the same frame, or is my understanding correct? Thanks!

MohitShridhar commented 3 years ago

@Ryou0634, during generation, we saved 10 extra frames after the last action (see this). So technically im[-1] should be after the last action.

ryokan0123 commented 3 years ago

OK, I see that 10 extra frames are saved, but what about the generated ResNet feature in json_feat_2.1.0?

Here’s what I’m seeing. I’m reading data from json_feat_2.1.0. ex['plan']['low_actions']contains n actions (a{0}, …,a{n-1}), and im contains n frame features (f{0}, …, f{n-1}), just before this line.

My understanding is that f{t} is used to predict a{t}, and we need additional frame f_{n} to predict \<stop>, like this.

Executed Action:  \<go>, a{0},…, a{n-2}, a{n-1} Frames:      f{0}, f{1}, …, f{n-1}, f{n} Predicted Action:  a{0}, a{1}, …, a{n-1}, \<stop>

So, if im[-1] corresponds to f{n-1}, it does not make sense to me to copy f{n-1} to f_{n}... Any clarification is helpful!

MohitShridhar commented 3 years ago

Hmm.. sorry maybe I am misunderstanding something, but in your description, if f_{n} is the last frame, then im[-1] does correspond to f{n} and not f{n-1}, right? Becauseim[-1] is the last element of the list? Or am I missing something?

ryokan0123 commented 3 years ago

Thank you for your patient response. Let me clarify my understanding.

I see that ex['plan']['low_actions'] contains only actual actions without <stop> (e.g., [LookDown, ..., PutObject]). len(ex['plan']['low_actions']) and len(im.shape[0]) are the same number, so I assume that we need one extra frame to predicting <stop>. However, im[-1] is actually a frame to predict the last actual action (e.g., PutObject), which changes the state, and the frame to predict <stop> should be a new frame, not im[-1], I think.

davidnvq commented 3 years ago

@Ryou0634 What you think is correct. In order to predict <stop>, we may use a new frame after we finish all the actions. As Mohit clarified, the last 10 frames are actually similar and the exact new frame you are looking for. Say we have n low actions (including the <stop> action). In order to predict the low action i, we need to use frame i-1. And to predict the stop action, we need to use frame n-1. I believe that the last 10 frames here are actually the frame n-1. Since the <stop> action make no change to the environment, frame n-1 or frame n, or frame n+1 are all similar. @MohitShridhar correct me if I'm wrong. I actually re-generate the frames myself using this intuition.

MohitShridhar commented 3 years ago

@davidnvq thanks! Yes, this is correct with full feat_conv.pt files from the Full Dataset.

@Ryou0634 I think you might be right. Thanks for looking into this! I'll look at it in depth soon. A quick fix might be to simply append the last frame feat from the Full Dataset.

ryokan0123 commented 3 years ago

@MohitShridhar Sure, thanks! Meanwhile I'll generate a feature for the last frame for my purpose : ) @davidnvq Thank you!

MohitShridhar commented 3 years ago

Ok, fixed the feat_conv.pt files in Modeling Quickstart and the loader here.