[Question] The format of 12K human demonstration

njucckevin commented 11 months ago

Question

Hi, I'm confused with the human demonstrations provided in https://github.com/stanfordnlp/miniwob-plusplus-demos. These demonstrations seem mussy, which has dozens of (eg: 20+) state contain mouse up/down and keyboard up/down in one trajectory. Is there any method to get the cleaned or simplified actions, e.g. {'action': click, 'ref': '6'}, {'action': "type", 'ref': '10', "typed_text": "John"}. I want to use these 12k demonstrations to supervised finetuned my own model.

Thanks a lot!

ppasupat commented 11 months ago

The demonstrations in that repository record the raw JavaScript events. Mouse clicks are also recorded as mouse up + mouse down, for example.

In the project I was involved in (Workflow-Guided Exploration), we converted the MiniWoB demonstrations into a graph structure. The method _parse_raw_demo_original is probably close to what you want (though it probably won't work out of the box; the code is pretty old).

There is also the paper Understanding HTML with Large Language Models who trained a model using the demonstrations, though I don't know where their code is.

In any case, I have created a feature request for the conversion code (#87).

jkterry1 commented 4 months ago

Closing in favor of #87

Farama-Foundation / miniwob-plusplus

[Question] The format of 12K human demonstration #86

Question