OSU-NLP-Group / SeeAct

[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
https://osu-nlp-group.github.io/SeeAct/
Other
571 stars 69 forks source link

What is the meaning of an empty `pos_candidate`? #43

Open liaopeiyuan opened 2 months ago

liaopeiyuan commented 2 months ago

There are 761 rows in the HuggingFace dataset osunlp/Multimodal-Mind2Web that have an empty pos_candidate.

The rows span across 497 tasks:

{'test_domain': 164, 'test_task': 47, 'test_website': 34, 'train': 252}

Here's a sample task that has an empty pos_candidate in one of the steps: https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web/viewer/default/train?q=6687eb6c-7154-4176-83a8-e841f78089d9 (row=1659)

It appears that src/data_utils/evaluation_utils.py and src/offline_experiments/screenshot_generation/*.py assume that an empty pos_candidates implies the failure of the agent, and since "A task is regarded as successful only if all steps have succeeded," there could be a lack of clarity on what the accuracy gap of the "whole success rate" means in Table 4.

boyugou commented 3 weeks ago

Hi Peiyuan,

Sorry for the late reply.

Mind2Web had some preprocess, which could possibly filter out the ground truth elements. That's the root of this. (You can check the original Mind2Web paper for more details.)

Indeed, for these cases, a reasonable process is to label these substeps as failures, although it hurts whole success rates.

I noticed some works directly view them as failures, while some works still evaluate them, but of course the Element Acc will be 0. (But it does improve OP scores, since it can be evaluated independently).

If you have any other questions, feel free to let us know.