Understanding processing of Mind2Web dataset for Lumos grounding

DanielRoeder1 commented 4 months ago

Hello,

I am trying to map the Lumos WebAgent grounding dataset onto the original Mind2Web dataset. Unfortunetly the ids (annotation_id, action_uid) were removed in the Lumos version but via query extraction and matching I can match 1001/1009 samples to their corresponding Mind2Web entries.

But the problem that I am facing now is that Lumos must have done some processing on the actions itself. Lumos appears to have sometimes more, sometimes less actions (i.e. user msgs defining a grounding sentence). Why is this the case? Which processing was applied?

For my work I need a mapping of the Lumos grounding steps (that is the user msgs in the Lumos dataset) to the html_source code found in Mind2Web.

Happy to receive and guidance or advice and thanks for the great open-source work!

yuchenlin commented 1 month ago

@WadeYin9712 plz take a look at this issue?

WadeYin9712 commented 1 month ago

Hi Daniel,

Sorry for the late reply! I was pretty busy working on the other ongoing project.

The mismatch might be due to the annotation conversion process, since sometimes the LLM may output something with invalid formats, and those will be arbitarily discarded (You can take a look at prompt_convertion.py in data folder). But indeed I wasn't aware of the issue about extra actions. But it might be simple to filter these out by matching the actions with the original ones in Mind2Web: If the action doesn't appear in Mind2Web, there must be sth wrong and feel free to remove them.

Let me know if you have further questions!

allenai / lumos

Understanding processing of Mind2Web dataset for Lumos grounding #5