DearCaat / RRT-MIL

[CVPR 2024] Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology
85 stars 8 forks source link

PLIP feature dimension #5

Closed anhtienng closed 8 months ago

anhtienng commented 8 months ago

Hello. Thank you for your amazing work.

Originally, the feature from PLIP vision model is 768-dim. How can you map it to 512-dim ?

Thank you.

DearCaat commented 8 months ago

Thanks for attention to our work.

We used the PLIP model code directly for feature extraction and did not do any dimension mapping. Under my actual processing, I found out that PLIP has 512 dimensions of image features. Here is the part of codes and logs: extract_features_fp.py:

model = PLIPM('vinid/plip')
n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(str(n_parameters/1000**2))
mean, std = (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)
eval_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean = mean, std = std)])

plip.py:

class PLIPM(torch.nn.Module):
    def __init__(self, model_name):
        super(PLIPM,self).__init__()
        self.model = CLIPModel.from_pretrained(model_name, use_auth_token=None)
    def forward(self, input):
        return self.model.get_image_features(input)

Logs:

feature extraction settings
target patch size:  (224, 224)
pretrained:  True
transformations:  Compose(
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)
processing /data/tangwenhao/tcga/brca/patches/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291.h5: total of 10 batches
batch 0/10, 0 files processed

computing features for /data/tangwenhao/tcga/brca/feats/convnext/h5_files/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291.h5 took 131.58712887763977 s
features size:  (10126, 512)
coordinates size:  (10126, 2)
anhtienng commented 8 months ago

thank you.

I understood it now. You used the features after the pretrained projection layer.

anhtienng commented 8 months ago

When using HuggingFace API, the PLIP vision model (with projection) can be loaded as follow from transformers import CLIPVisionModelWithProjection model = CLIPVisionModelWithProjection.from_pretrained("vinid/plip") image_features = model(batch_input).image_embeds

DearCaat commented 8 months ago

When using HuggingFace API, the PLIP vision model (with projection) can be loaded as follow from transformers import CLIPVisionModelWithProjection model = CLIPVisionModelWithProjection.from_pretrained("vinid/plip") image_features = model(batch_input).image_embeds

Thank you very much. But I wonder what is the difference between this way of loading and the following code provided in the official repository?

from transformers import CLIPModel
model = CLIPModel.from_pretrained(name, use_auth_token=auth_token)
self.model.get_image_features(**batch)
anhtienng commented 8 months ago

It's similar. My code is concise, loading only the vision module. The default code loads both vision and text modules. You can verify it in the library code at transformers/models/clip/modeling_clip.py

DearCaat commented 8 months ago

Got it, thanks a lot.