The official pytorch implementation for MSCPT.
[Question] #2

bryanwong17 opened 1 week ago

bryanwong17 commented 1 week ago

Hi, I would like to know, when calculating the similarity matrix for image-text similarity in Graph Prompt Tuning, is the similarity score calculated between each patch and individual sub-descriptions (e.g., p1 vs D1, p1 vs D2, p1 vs D3) or between each patch and the concatenated descriptions (e.g., p1 vs D, where D is the concatenation of D1, D2, and D3)? Thanks!

Hanminghao commented 1 week ago

Assuming there are ( K ) subtypes of cancer, each with ( C ) descriptions, we calculate the similarity of each patch to all ( K \times C ) descriptions. This process results in a similarity matrix ( S ) with dimensions ( M \times (K \times C) ), where ( M ) is the number of patches. So your first idea was right.

bryanwong17 commented 1 week ago

Got it. Thanks for the clarification! I wonder also how to make the dimension of the visual and text embeddings the same when calculating the similarity?

Hanminghao commented 1 week ago

CLIP and PLIP have a projection head to align visual and text embeddings. You can see lines 188 and 234 of the mscpt.py file

bryanwong17 commented 1 week ago

Got it! From my understanding, you applied zero-shot to filter out patches that do not correspond to the class/category. How many patches did you select for both low and high magnifications? I could only find in the paper that you selected 30 patches at 5x magnification. Another question: Did you apply zero-shot filtering only during training, or during both training and inference? Thank you!"

Hanminghao commented 6 days ago

At low magnification we used the zero-shot capability of the model to select 30 patches for each cancer subtype, while at high magnification we used all patches. For another problem, we applied zero-shot both during training and reasoning.

bryanwong17 commented 2 days ago

Hi, thank you for the confirmation! After reviewing your code in select_5x_pic.py, particularly in the section below, is it correct that you are retrieving the top-K patches for all WSI classes? For example, if WSI A is LUAD, would you retrieve the top-K patches for both LUAD and LUSC? or only LUAD? Thanks!

logits = text_feats @ img_feats.T
            if args.top_k > img_feats.shape[0]:
                topk_values, topk_indices = torch.topk(logits, img_feats.shape[0], dim=1)
                topk_values, topk_indices = torch.topk(logits, args.top_k, dim=1)
            pred = topk_values.sum(dim=1).argmax().cpu().item()
            select_id = topk_indices.flatten().cpu().numpy()
            coord = coords[select_id]
            df.loc[idx, 'pred'] = 1 if pred == category else 0  # 进行赋值
            for idx, (x,y) in enumerate(coord):
                big_img = wsi.read_region((x,y), patch_level, (patch_size, patch_size)).convert('RGB')
                big_img = big_img.resize((224,224))
                big_img.save(os.path.join(save_path, f"{idx}_{x}_{y}.png"))
Hanminghao commented 2 days ago

We will retrieve the top-K patches for all cancer subtypes, because at this time we do not know what type the current WSI is.

bryanwong17 commented 6 hours ago

I see. Does this mean that for WSI LUAD, you select the top-K patches for both LUAD and LUSC? If so, doesn’t WSI LUAD only contain LUAD and normal patches (without LUSC)? I assume that the top-K patches that are most similar to LUSC class text would return the normal patches?

Also, have you noticed the coordinates of top-K patches between categories are redundant? Let say we want to take top 30 for each LUAD and LUSC for each WSI. But in total, we could get less than 60