Generate trajectories (suggest additional ones in comments)

Eugleo commented 7 months ago

Prerequisite: #6.

~~My current plan is to generate a around 3 "positive" trajectories and 3 "negative" trajectories for all of the tasks below.~~

My current plan is to generate 3 trajectories and 3 alternative descriptions per task.

Priorities:

[x] Detecting the type of room the agent is in
[x] Detecting presence of an object in the frame (window?)
[x] Videos from modeled scenes instead of the photorealistic ones
[x] Walking up and down stairs (video understanding)
[x] Walking through objects (video understanding)
[x] Throwing something on the ground vs it already being on the ground (video understanding)
[ ] Object recognition for out-of-distribution objects (e.g. random objects in a room)
[ ] Spatial reasoning
[ ] Temporal reasoning (this with the one above could be one set of test trajectories)

Ideas:

Different, more out-of-distribution angles
Detecting proximity to an object (ideally in a scene where we also have GT for this)
Dropping something the agent held on the ground.
Toppling something over.
Pushing something from the top of the table to make it fall.
Natural-looking movement (e.g. walking straight vs strafing to one side to get somewhere)

Dont-Care-Didnt-Ask commented 7 months ago

How will "negative" trajectories look like?

It seems to me, that we can generally use positives from other tasks as negatives. So I would rather propose to make 6 diverse positives for each task. Positive descriptions also work well for "all-versus-all" evaluation, which I outlined in #4.

Eugleo commented 7 months ago

It seems that by the two different setups we answer two slightly different questions:

All-v-all: We assume all we care about are the different tasks we measure. Then we ask: Can the VLM distinguish those from each other?

Pos+Neg examples: Assuming the neg examples are good (e.g. you almost see a window but not quite), we answer the question: Can the model recognize this task by itself, reliably?

All in all is in some way easier for us, because thinking about what would be a good negative example (and trying to get a lot of them) is a futile task.

However, pos+neg is easier in other ways — namely, it might be hard to have trajectories that can only have one label in this env (e.g. you're looking at a window but also inadvertently getting closer to a vase).

Maybe I can try doing all-v-all, and if the task overlap is hard to get rid of I'll switch to pos+neg?

Dont-Care-Didnt-Ask commented 7 months ago

Yes, this sounds reasonable. I agree that trajectories from other tasks will not necessarily be the hardest negatives, but the hope is that at least we'll have a lot of "medium-hard" negatives.

We can think about specialized, good negative examples as an extension of benchmark -- the hard version (and therefore we should focus on them later).

evgunter commented 7 months ago

based on the clip spatial reasoning article from slack (https://medium.com/@hendrik.suvalov/evaluating-clip-for-spatial-reasoning-7ffcc8e00f82) it seems like it could be good to have the same task from the clearest possible camera angle and from a more oblique camera angle (to benchmark the extent to which a model has robust spatial reasoning)

Eugleo / vlmrm

Generate trajectories (suggest additional ones in comments) #3