cooelf / Auto-GUI

Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
https://arxiv.org/abs/2309.11436
Apache License 2.0
174 stars 15 forks source link

What's the blip-2 feature extractor details? #9

Closed truebit closed 9 months ago

truebit commented 9 months ago

Thanks for the work. I'd like to inference this model on custom images and goals, I tried to write the inference code by myself.

but I found that the obj file unpickles the image as a tensor, so I'd like to know what's the conversion method used to load the image?

According to the utils_data.py, the image_ids was retrieved from image_ids = torch.tensor(source_image).squeeze(); According to the paper, "Given a screenshot Xscreen ∈ Rh×w×3 with height h and width w at step t ∈ [1, k], we first feed it to a frozen image encoder (e.g., BLIP-2 (Li et al., 2023)) and extract vision features Hscreen ∈ R1×ds where ds is the dimension of the vision features."

So I believe that images are pickled after its image features has been extracted into the tensor. But there is no details and blip-2 model details used for feature extraction.

truebit commented 9 months ago

found in fetch_features.py. But the model used in code(Salesforce/blip2-opt-2.7b) is not what the paper said:

The vision features are obtained by the frozen BLIP-2 encoder (Li et al., 2023) (version: blip2_t5_instruct).