GUI-World: A Dataset for GUI-Orientated Multimodal Large Language Models [![Paper](https://img.shields.io/badge/Paper-%F0%9F%8E%93-lightgrey?style=flat-square)](https://arxiv.org/abs/2406.10819) [![Dataset](https://img.shields.io/badge/Dataset-%F0%9F%92%BE-green?style=flat-square)](https://huggingface.co/datasets/shuaishuaicdp/GUi-World) [![Website](https://img.shields.io/badge/Website-%F0%9F%90%BE-green?style=flat-square)](https://gui-world.github.io/)

Updates & News

We will release our benchmark code soon.

[16/06/2024] 📄 Paper on arxiv has released!

Updates \& News
Contents
Dataset: GUI-World
- Overview
- How to use GUI-World
GUI-Vid: A GUI-Oriented VideoLLM
Contribution
Acknowledgments
Citation

Dataset: GUI-World

Overview

GUI-World introduces a comprehensive benchmark for evaluating MLLMs in dynamic and complex GUI environments. It features extensive annotations covering six GUI scenarios and eight types of GUI-oriented questions. The dataset assesses state-of-the-art ImageLLMs and VideoLLMs, highlighting their limitations in handling dynamic and multi-step tasks. It provides valuable insights and a foundation for future research in enhancing the understanding and interaction capabilities of MLLMs with dynamic GUI content. This dataset aims to advance the development of robust GUI agents capable of perceiving and interacting with both static and dynamic GUI elements.

How to use GUI-World

GUI-World is splited to train and test set, which can be accessed from huggingface.

GUI-Vid: A GUI-Oriented VideoLLM

GUI-Vid is a VideoLLM finetuned from Videochat2. You can reproduce our experiment results following these instructions: Prepare the Environment

git clone https://github.com/Dongping-Chen/GUI-World.git
cd GUI-World/GUI_Vid
conda create -n gui python=3.9
conda activate gui
pip install -r requirements.txt

GUI-Oriented Finetuning

Download [GUI-World] and modify the root path in GUI_Vid/configs/instruction_data.py, which is the root dir for your download GUI-World.
Set vit_blip_model_path, llama_model_path and videochat2_model_path in scripts/videochat_vicuna/config_7b_stage3.py, these checkpoints can be download from GUI-Vid.

# Vicuna
bash scripts/videochat_vicuna/run_7b_stage3.sh

Inference with GUI-Vid You can first download checkpoint from Huggingface. You also need to set the config according to the guidance in Videochat2. Then, set the model_path in scripts/demo_local.py. Use the following script to inference a GUI video:

python demo_local.py \
--ckpt_path <path to GUI-Vid> \
--keyframe 8 \
--video_path <path to your video> \
--qs <your query>

Contribution

Contributions to this project are welcome. Please consider the following ways to contribute:

Proposing new features or improvements
Benchmark other mainstream MLLMs

Acknowledgments

Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in Videochat2.

Citation

@misc{chen2024guiworld,
      title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents}, 
      author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun},
      year={2024},
      eprint={2406.10819},
      archivePrefix={arXiv},
}

Dongping-Chen / GUI-World

readme