Eugleo / vlmrm

MIT License
0 stars 0 forks source link

Generate trajectories (suggest additional ones in comments) #3

Open Eugleo opened 7 months ago

Eugleo commented 7 months ago

Prerequisite: #6.

My current plan is to generate a around 3 "positive" trajectories and 3 "negative" trajectories for all of the tasks below.

My current plan is to generate 3 trajectories and 3 alternative descriptions per task.

Priorities:

Ideas:

Dont-Care-Didnt-Ask commented 7 months ago

How will "negative" trajectories look like?

It seems to me, that we can generally use positives from other tasks as negatives. So I would rather propose to make 6 diverse positives for each task. Positive descriptions also work well for "all-versus-all" evaluation, which I outlined in #4.

Eugleo commented 7 months ago

It seems that by the two different setups we answer two slightly different questions:

All-v-all: We assume all we care about are the different tasks we measure. Then we ask: Can the VLM distinguish those from each other?

Pos+Neg examples: Assuming the neg examples are good (e.g. you almost see a window but not quite), we answer the question: Can the model recognize this task by itself, reliably?

All in all is in some way easier for us, because thinking about what would be a good negative example (and trying to get a lot of them) is a futile task.

However, pos+neg is easier in other ways — namely, it might be hard to have trajectories that can only have one label in this env (e.g. you're looking at a window but also inadvertently getting closer to a vase).

Maybe I can try doing all-v-all, and if the task overlap is hard to get rid of I'll switch to pos+neg?

Dont-Care-Didnt-Ask commented 7 months ago

Yes, this sounds reasonable. I agree that trajectories from other tasks will not necessarily be the hardest negatives, but the hope is that at least we'll have a lot of "medium-hard" negatives.

We can think about specialized, good negative examples as an extension of benchmark -- the hard version (and therefore we should focus on them later).

evgunter commented 7 months ago

based on the clip spatial reasoning article from slack (https://medium.com/@hendrik.suvalov/evaluating-clip-for-spatial-reasoning-7ffcc8e00f82) it seems like it could be good to have the same task from the clearest possible camera angle and from a more oblique camera angle (to benchmark the extent to which a model has robust spatial reasoning)