Closed anhtienng closed 8 months ago
Thanks for attention to our work.
We used the PLIP model code directly for feature extraction and did not do any dimension mapping.
Under my actual processing, I found out that PLIP has 512
dimensions of image features. Here is the part of codes and logs:
extract_features_fp.py
:
model = PLIPM('vinid/plip')
n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(str(n_parameters/1000**2))
mean, std = (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)
eval_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean = mean, std = std)])
plip.py
:
class PLIPM(torch.nn.Module):
def __init__(self, model_name):
super(PLIPM,self).__init__()
self.model = CLIPModel.from_pretrained(model_name, use_auth_token=None)
def forward(self, input):
return self.model.get_image_features(input)
Logs
:
feature extraction settings
target patch size: (224, 224)
pretrained: True
transformations: Compose(
ToTensor()
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)
processing /data/tangwenhao/tcga/brca/patches/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291.h5: total of 10 batches
batch 0/10, 0 files processed
computing features for /data/tangwenhao/tcga/brca/feats/convnext/h5_files/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291.h5 took 131.58712887763977 s
features size: (10126, 512)
coordinates size: (10126, 2)
thank you.
I understood it now. You used the features after the pretrained projection layer.
When using HuggingFace API, the PLIP vision model (with projection) can be loaded as follow
from transformers import CLIPVisionModelWithProjection
model = CLIPVisionModelWithProjection.from_pretrained("vinid/plip")
image_features = model(batch_input).image_embeds
When using HuggingFace API, the PLIP vision model (with projection) can be loaded as follow
from transformers import CLIPVisionModelWithProjection
model = CLIPVisionModelWithProjection.from_pretrained("vinid/plip")
image_features = model(batch_input).image_embeds
Thank you very much. But I wonder what is the difference between this way of loading and the following code provided in the official repository?
from transformers import CLIPModel
model = CLIPModel.from_pretrained(name, use_auth_token=auth_token)
self.model.get_image_features(**batch)
It's similar. My code is concise, loading only the vision module.
The default code loads both vision and text modules.
You can verify it in the library code at transformers/models/clip/modeling_clip.py
Got it, thanks a lot.
Hello. Thank you for your amazing work.
Originally, the feature from PLIP vision model is 768-dim. How can you map it to 512-dim ?
Thank you.