What's the blip-2 feature extractor details?

Thanks for the work. I'd like to inference this model on custom images and goals, I tried to write the inference code by myself.

but I found that the obj file unpickles the image as a tensor, so I'd like to know what's the conversion method used to load the image?

According to the utils_data.py, the image_ids was retrieved from image_ids = torch.tensor(source_image).squeeze(); According to the paper, "Given a screenshot Xscreen ∈ Rh×w×3 with height h and width w at step t ∈ [1, k], we first feed it to a frozen image encoder (e.g., BLIP-2 (Li et al., 2023)) and extract vision features Hscreen ∈ R1×ds where ds is the dimension of the vision features."

So I believe that images are pickled after its image features has been extracted into the tensor. But there is no details and blip-2 model details used for feature extraction.

cooelf / Auto-GUI

What's the blip-2 feature extractor details? #9