It seems the evaluation might have some bugs

askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

MIT License

360 stars 77 forks source link

It seems the evaluation might have some bugs #92

Closed nikepupu closed 3 years ago

nikepupu commented 3 years ago

This is returning False. However, it should be True. This is from : pick_clean_then_place_in_recep-AppleSliced-None-DiningTable-27/trial_T20190907_151802_277016

nikepupu commented 3 years ago

This is from subgoal evaluation. The sequence of actions from my model: [('PutObject', 'sink'), ('ToggleObjectOn', 'faucet'), ('ToggleObjectOff', 'faucet'), ('PickupObject', 'apple'), ('<>', 'END')]

nikepupu commented 3 years ago

image sequence before each actions.

MohitShridhar commented 3 years ago

@nikepupu thanks for looking into this!

The cleaned state is maintained here. Sounds like something weird is happening with the objectType of sliced objects, and that's messing up this.

Are you using ai2thor==2.1.0?

nikepupu commented 3 years ago

Just checked again. I am using ai2thor==2.1.0. Thanks for the quick reply.

MohitShridhar commented 3 years ago

@nikepupu I'll look into this later this week when I get some time.

For the time being, I would suggest putting a breakpoint here and checking if the objectType is somehow wrong.

nikepupu commented 3 years ago

Some additional information for a related issue: traj_id : pick_heat_then_place_in_recep-PotatoSliced-None-SinkBasin-13/trial_T20190909_115736_122556

I think a better approach might be comparing objectType and then compare objects states. Issliced, ispickedup, etc... objectId seems to be unreliable.

Here the provided evaluation method forces the model to pick up a specific slice of object. However, all slices are sticking together; it does not really make any sense to pick a specific slice among all slices.

MohitShridhar commented 3 years ago

Thanks @nikepupu! Fixed in https://github.com/askforalfred/alfred/commit/904e1b025a626d201123a898f837e4edfeea741e.

The specific slice doesn't matter, but the objectId might be still relevant for some referring expressions e.g. "the apple on the left". So I added a simple fix to remove Sliced... from the ID string.

Fortunately, this reward function is not used anywhere in the leaderboard evaluation script, so existing and future submissions are unaffected.