berkeley-hipie / HIPIE

[NeurIPS2023] Code release for "Hierarchical Open-vocabulary Universal Image Segmentation"
https://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/
MIT License
269 stars 20 forks source link

How to caculate the mean similarity #4

Open shipengai opened 1 year ago

shipengai commented 1 year ago

hello, Is there code to caculate the mean similarity which mentioned by this paper?

jacklishufan commented 1 year ago

Hi Currently, we do not have plans for releasing such codes. We might release them after releasing training and evaluation code and more checkpoints. However, I can provide a basic overview of our pipeline.

To obtain the text similarity, you can use

test_categories = get_openseg_labels("coco_panoptic",prompt_engineered=False)
expression, positive_map_idx_token = create_queries_and_maps(test_categories,demo.predictor.tokenizer)
with torch.no_grad():
    text_features = demo.predictor.model.forward_text([expression],'cuda')
text_feature_words = []
for k,v in positive_map_idx_token.items():
    text_feature_words.append(text_features['hidden'][0,v,:].detach().cpu().mean(0))
text_feature_words = torch.stack(text_feature_words)
text_feature_words = torch.nn.functional.normalize(text_feature_words,dim=-1)
dist_text = torch.cdist(text_feature_words,text_feature_words) # 2 - 2 |A| |B|
dist_text = 0.5* (2.0 - dist_text)

Then you can visualize dist_text which has a shape N_CLS X N_CLS

The visual features are non-trivial and more complicated and it requires considerable hacking into the data loading and model inference process.

The first step is to sample N annotations for each class, then for each image, the following code will extract the feature map of this image

batch = mapper(batch) # mapper is a DatasetMapper instance
samples = demo.predictor.model.preprocess_image([batch])
samples = nested_tensor_from_tensor_list(samples, size_divisibility=32)
with torch.no_grad():
    features,_ = demo.predictor.model.detr.detr.backbone(samples)
img_features,mask = features[-1].decompose()
img_features = img_features.cpu() #1 X C X  H X W

Then you want to get the ground truth mask and resize it to the same size as the feature map

msk = batch['pan_seg_gt'] == instance_id # H X W
mask_up = F.interpolate(msk.float()[None,None],img_features.shape[-2:],mode='area')  #1 X C 1 X H X W

The final feature of this mask can be obtained through mask pooling

mask_up = mask_up / mask_up.sum()
out = torch.einsum('bchw,bdhw->bdc',img_features,mask_up)[0][0] # Final output of shape C, thi

Then you need to save out for each selected annotation, average by class, and visualize. Let me know if you have more questions.

shipengai commented 1 year ago

Thanks for your reply!I will try it.