cooelf / Auto-GUI

Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
https://arxiv.org/abs/2309.11436
Apache License 2.0
174 stars 15 forks source link

any inference code or something to check the model #8

Open Occupying-Mars opened 9 months ago

truebit commented 9 months ago

After some investigation, I replicated inference code using the same goal with supplied one/multiple snapshots and history actions. It is not working very well on zero-shot situations.

YiDa858 commented 8 months ago

@truebit Can you publish your inference code? I would appreciate it!

kirtishrinkhala commented 8 months ago

@truebit Please share the inference code if possible.

kirtishrinkhala commented 8 months ago

I have been working on writing the inference code, here is what I could achieve till now. I wrote a function to produce the processed input for an image and the goal. However, now I am not sure on how to use that as an input on a pretrained model.

This is the code that I wrote to process the image file and the goal:

`

import action_type, action_matching
import tensorflow as tf
import numpy as np
from tqdm import tqdm
import json
import jax.numpy as jnp
import argparse
import pickle
import torch
import tensorflow as tf
from PIL import Image
from transformers import AutoProcessor, Blip2Model

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
model.to(device)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")

def parse_image(
    image_file_path
):

    goal = "How to login?"
    step_id = "123"
        # episode_id = ex.features.feature['episode_id'].bytes_list.value[0].decode('utf-8')
    output_ep = {
        "goal": goal,
        "step_id": step_id
    }

    img = Image.open('sample.png')

    image_height = img.height
    image_height
    image_width = img.width
#     image_channels = img.getChannel()
    with torch.no_grad():
        inputs = processor(images=img, return_tensors="pt").to(device, torch.float16)
        image_features = model.get_image_features(**inputs).pooler_output[0]
        image_features = image_features.detach().cpu()
    output_ep["image"] = image_features
    output = []
    output.append(output_ep)
    parsed_episode = []
    parsed_episode.append({"episode_id":123, "data":output})
    return parsed_episode

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--dataset', type=str, default='general')
    parser.add_argument("--split_file", type=str, default="dataset/general_texts_splits.json")
    parser.add_argument('--output_dir', type=str, default='dataset')
    parser.add_argument('--get_images', default=True, action='store_true')
    parser.add_argument('--get_annotations', default=True, action='store_true')
    parser.add_argument('--get_actions', default=True, action='store_true')
    parser.add_argument('--file_path', type=str, default='sample.png')

    args = parser.parse_args()
    return args

if __name__ == '__main__':

    args = parse_args()
    print('====Input Arguments====')
    print(json.dumps(vars(args), indent=2, sort_keys=False))

    all_parsed_episode = parse_image(args.file_path)

    with open(f"{args.output_dir}_test_val.obj", "wb") as wp:
        pickle.dump(all_parsed_episode,wp)

`

Jiayi-Pan commented 5 months ago

Hi friends,

We’ve got AutoUI running and tested its end-to-end performance in our recent paper. You can find the inference code here

https://github.com/Berkeley-NLP/Agent-Eval-Refine/tree/main/exps/android_exp/models/Auto-UI

Yingrjimsch commented 4 months ago

Hi friends,

We’ve got AutoUI running and tested its end-to-end performance in our recent paper. You can find the inference code here

https://github.com/Berkeley-NLP/Agent-Eval-Refine/tree/main/exps/android_exp/models/Auto-UI

Great job thanks will try that 👍 any insights in how good it works for zero shot approaches?