Closed bryanwong17 closed 1 month ago
Assuming there are ( K ) subtypes of cancer, each with ( C ) descriptions, we calculate the similarity of each patch to all ( K \times C ) descriptions. This process results in a similarity matrix ( S ) with dimensions ( M \times (K \times C) ), where ( M ) is the number of patches. So your first idea was right.
Got it. Thanks for the clarification! I wonder also how to make the dimension of the visual and text embeddings the same when calculating the similarity?
CLIP and PLIP have a projection head to align visual and text embeddings. You can see lines 188 and 234 of the mscpt.py file
Got it! From my understanding, you applied zero-shot to filter out patches that do not correspond to the class/category. How many patches did you select for both low and high magnifications? I could only find in the paper that you selected 30 patches at 5x magnification. Another question: Did you apply zero-shot filtering only during training, or during both training and inference? Thank you!"
At low magnification we used the zero-shot capability of the model to select 30 patches for each cancer subtype, while at high magnification we used all patches. For another problem, we applied zero-shot both during training and reasoning.
Hi, thank you for the confirmation! After reviewing your code in select_5x_pic.py
, particularly in the section below, is it correct that you are retrieving the top-K patches for all WSI classes? For example, if WSI A is LUAD, would you retrieve the top-K patches for both LUAD and LUSC? or only LUAD? Thanks!
logits = text_feats @ img_feats.T
if args.top_k > img_feats.shape[0]:
topk_values, topk_indices = torch.topk(logits, img_feats.shape[0], dim=1)
else:
topk_values, topk_indices = torch.topk(logits, args.top_k, dim=1)
pred = topk_values.sum(dim=1).argmax().cpu().item()
select_id = topk_indices.flatten().cpu().numpy()
coord = coords[select_id]
df.loc[idx, 'pred'] = 1 if pred == category else 0 # 进行赋值
for idx, (x,y) in enumerate(coord):
big_img = wsi.read_region((x,y), patch_level, (patch_size, patch_size)).convert('RGB')
big_img = big_img.resize((224,224))
big_img.save(os.path.join(save_path, f"{idx}_{x}_{y}.png"))
We will retrieve the top-K patches for all cancer subtypes, because at this time we do not know what type the current WSI is.
I see. Does this mean that for WSI LUAD, you select the top-K patches for both LUAD and LUSC? If so, doesn’t WSI LUAD only contain LUAD and normal patches (without LUSC)? I assume that the top-K patches that are most similar to LUSC class text would return the normal patches?
Also, have you noticed the coordinates of top-K patches between categories are redundant? Let say we want to take top 30 for each LUAD and LUSC for each WSI. But in total, we could get less than 60
Sorry for the late reply; we just finished our National Day holiday. Regarding your first question, the WSI LUAD indeed only includes LUAD and normal patches. However, typically, using zero-shot CLIP and PLIP to retrieve LUAD and LUSC regions will still localize to the tumor areas within the WSI. Therefore, to increase robustness, we retrieve all possible types for the WSI.
As for the second question, retrieving the top 30 patches for LUAD and LUSC from the WSI may indeed result in fewer than 60 patches due to duplication. However, this won't affect the subsequent program operation. You can choose to input 60 patches with duplicates into the model, or you can remove the duplicates before inputting them into the model.
Hi, I would like to know, when calculating the similarity matrix for image-text similarity in Graph Prompt Tuning, is the similarity score calculated between each patch and individual sub-descriptions (e.g., p1 vs D1, p1 vs D2, p1 vs D3) or between each patch and the concatenated descriptions (e.g., p1 vs D, where D is the concatenation of D1, D2, and D3)? Thanks!