askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
360 stars 77 forks source link

Note on Human Annotations #54

Closed SouLeo closed 3 years ago

SouLeo commented 3 years ago

Hello, I've been going through the human annotations provided in the ALFRED corpus for my own purposes. And with the 589 I have collected, primarily from the "examine in light" data section (as I'm going sequentially through the training data) I have noticed a couple of things.

1) The human annotators are often confused as to what the object of interest is in the scene. Often they refer to an object like a clock as "brown object" or "paper weight" etc. 2) Far more descriptive instruction is provided for the navigation aspects of the dataset than the visual component of it 3) There are strange characters left in some human annotations such as parenthesis and question marks when annotators are confused by what they are viewing 4) There are several spelling errors such as "off" vs "of" and "close" vs "closet" vs "closest" 5) Many high level annotations provided by AMTs show that the annotators themselves do not understand the task. In the examine task, there are quite a few summarizations of the task as "turn on the light" or awkward phrasing such as "carry the clock to the light" without realizing the task is to view the clock itself. Basically, a very unnatural way for humans to communicate goals to an agent.

This is a very preliminary study, and I understand noise is typical in datasets. I will update this thread (if there is interest) with more evaluations of this dataset noise and human focused specificity. But at the minimum, I wanted the authors to be aware of some of these aspects so that greater prefiltering may be applied before training the dataset. I will also admit that the "examine in light" task was still one of the best performing tasks in the dataset; however, if this noise and annotator confusion carries into more complex-longer horizon tasks, it could possibly contribute to such low performance of current models.

thomason-jesse commented 3 years ago

This is really interesting! We are aware that our validation method did not force annotations to perfectly align with videos; this is also reflected in human performance on the task being below 100%. As you point out, this is somewhat expected in natural language annotations. Of the categories above: 1) Asks models to learn property as well as category words, which seems mostly good. "Brown object" is fun since "object" is vacuous except something to be interacted with, so the model will have to learn the color word instead of just doing object detection for the "statue" (or whatever the referent is). 2) This may be a function of how much more time the agent spends taking navigation actions, but it's interesting to note, for sure. 3) I'd be interested to see these, if you have them indexed. 4) Spelling errors are sort of okay, I think. A "real" system taking typed or ASR input from a person will have to content with spelling/word errors. 5) I'm not crazy about these slipping through our validation, agreeing. I think the "look at object in light" task probably has the highest rate of them just because it's a weird / non-canonical behavior for people.

We appreciate you pointing this out! Would be happy to hear more if you find it's useful for data filtering / finer-grained curriculum learning for models.

SouLeo commented 3 years ago

Awesome. I'm going to keep chugging through these, but if you want a good laugh, you should check the "Examine in light" with the basketball object. Some people were either really confused by the task, or noticed a trash can in the room and told the agent to sink a basket in the trash can.

MohitShridhar commented 3 years ago

Closing the issue. But feel free to continue the discussion.