askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
352 stars 77 forks source link

Are there missing objects in GT segmentation? #131

Open TopCoder2K opened 1 year ago

TopCoder2K commented 1 year ago

Hi, @MohitShridhar!

When I was debugging my model, I noticed that it can't take the Knife here 57 although the mask seems to be correct: mask_57 I checked that the distance is correct: the Knife has the 'visible' property to be equal to True, but the interaction fails with CounterTop|+00.09|+00.89|-01.52 is not visible. Then I decided to visualize the GT segmentation: gt_sem_seg_57 and there is no knife! One can think that it has the same color with the CounterTop, but I checked that instance_counter inside the thor_env.py indeed finds the only object --- the CounterTop...

Is it real or is there something I don't understand? Because if it is, we have to check somehow the number of such cases and maybe even recalculate the leaderboard results after fixing this.

TopCoder2K commented 1 year ago

Also, by the way, it is strange that the bottom of the frying pan does not belong to the frying pan, judging by the color

thomason-jesse commented 1 year ago

Can you identify the trajectory in the ALFRED dataset to which this frame belongs? We can confirm using the replay scripts and original video whether the knife is interactable in that case. There is some stochasticity in the AI2-THOR simulator we are aware of that can cause objects to kind of "blink" like this, but it's not always replicable.

TopCoder2K commented 1 year ago

@thomason-jesse, thank you for your fast answer!

Can you identify the trajectory in the ALFRED dataset to which this frame belongs?

Sorry, what do you mean by 'identify'? Should I send the trajectory in the format of the evaluation server or will it be enough to send the actions the model took? It is the 10th episode of the val_seen split ('pick_clean_then_place_in_recep-ButterKnife-None-Drawer-30/trial_T20190908_052007_212776'). The exact trajectory can be found in the 'Action' column of the log file 10.txt. I also can send the trajectory video and the trajectory data.

There is some stochasticity in the AI2-THOR simulator

Wow, I didn't know that! How can this manifest itself and how often does it happen? Can it also affect rendering? I hasn't managed to achieve determinism of the model execution. I fixed all the seeds, set torch to fully deterministic mode and even fixed 'PYTHONHASHSEED', but an execution of an episode is not deterministic.

thomason-jesse commented 1 year ago

To clarify: are these actions prescribed in the training data trajectory or actions your model has inferred separately? If you check out the execution video for the trajectory you named above (https://askforalfred.com/?vid=21032), it looks like the PDDL-planner-generated actions went for a different knife that might not exhibit this blinking/disappeared segmentation issue.

AI2THOR has a few non-deterministic quirks, as we note in a few of our FAQs and paper discussion on why even perfect replay from the PDDL-generated actions doesn't always result in 100% success rate. The idea of "fixing this" and re-doing leaderboard calculations is definitely out of scope.

Anyway, short answer: the segmentation mask on that knife in that scene configuration might just be bad and there's not much we can do about it 🤷.

TopCoder2K commented 1 year ago

To clarify: are these actions prescribed in the training data trajectory or actions your model has inferred separately?

These actions the model has inferred separately.

If you check out the execution video for the trajectory you named above (https://askforalfred.com/?vid=21032) <...>

Unfortunately, I can't see the video (I don't know why): Screenshot from 2023-01-24 15-49-55 But the trajectory may be different, since the model predicted these actions, didn't take from ALFRED. The problem is that the knife increased the number of agent's failed actions and confused it.

as we note in a few of our FAQs and paper discussion on why even perfect replay from the PDDL-generated actions doesn't always result in 100% success rate

Hmmm, the end of the sentence seems familiar to me, but I don't remember seeing it in the ALFRED article... Anyway, I've already forgotten about it, so thank you for pointing out :+1:

The idea of "fixing this" and re-doing leaderboard calculations is definitely out of scope.

I see. But can we guarantee that the number of such objects is very small for the test splits (e.g. they can be in 2-3 episodes)? If not, the leaderboard results can be biased...