IDEA-Research / Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
https://arxiv.org/abs/2401.14159
Apache License 2.0
14.88k stars 1.38k forks source link

automatic_label_simple_demo.py RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 256, 256] because the unspecified dimension size -1 can be any value and is ambiguous #423

Open BeijingBlueSky opened 9 months ago

BeijingBlueSky commented 9 months ago

Hi, i got an error as: Traceback (most recent call last): File "automatic_label_ramdemo.py", line 303, in masks, , _ = predictor.predict_torch( File "/home/anaconda3/envs/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/home/segment-anything-main/segment_anything/predictor.py", line 229, in predict_torch low_res_masks, iou_predictions = self.model.mask_decoder( File "/home/anaconda3/envs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, *kwargs) File "/home/segment-anything-main/segment_anything/modeling/mask_decoder.py", line 94, in forward masks, iou_pred = self.predict_masks( File "/home/segment-anything-main/segment_anything/modeling/mask_decoder.py", line 144, in predict_masks masks = (hyper_in @ upscaled_embedding.view(b, c, h w)).view(b, -1, h, w) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 256, 256] because the unspecified dimension size -1 can be any value and is ambiguous

boxes_filt and logits_filt are tensor([], size=(0,255)) in function get_grounding_output after running automatic_label_simple_demo.py, the function is

boxes_filt, scores, pred_phrases = get_grounding_output( model, image, tags, box_threshold, text_threshold, device=device )

print(f"Before NMS: {boxes_filt.shape[0]} boxes") --> get 0 boxes nms_idx = torchvision.ops.nms(boxes_filt, scores, iou_threshold).numpy().tolist() boxes_filt = boxes_filt[nms_idx] pred_phrases = [pred_phrases[idx] for idx in nms_idx] print(f"After NMS: {boxes_filt.shape[0]} boxes") --> get 0 boxes

----------- more information ------------------------- there are many -inf in the output of grounding model: with torch.no_grad(): outputs = model(image[None], captions=[caption]) where caption is 'building, person, illuminate, man, neon light, night, night view, red, retail, sign, signage, store, storefront, writing.'

outputs["pred_logits"].cpu() tensor([[[-4.1339, -3.2799, -4.7899, ..., -inf, -inf, -inf], [-4.1563, -3.3208, -4.8095, ..., -inf, -inf, -inf], [-4.1277, -3.3192, -4.7960, ..., -inf, -inf, -inf], ..., [-4.2636, -3.5951, -4.8640, ..., -inf, -inf, -inf], [-4.2648, -3.5952, -4.8737, ..., -inf, -inf, -inf], [-4.3287, -3.7804, -4.9183, ..., -inf, -inf, -inf]]])

outputs["pred_logits"].cpu().sigmoid()[0] tensor([[0.0158, 0.0363, 0.0082, ..., 0.0000, 0.0000, 0.0000], [0.0154, 0.0349, 0.0081, ..., 0.0000, 0.0000, 0.0000], [0.0159, 0.0349, 0.0082, ..., 0.0000, 0.0000, 0.0000], ..., [0.0139, 0.0267, 0.0077, ..., 0.0000, 0.0000, 0.0000], [0.0139, 0.0267, 0.0076, ..., 0.0000, 0.0000, 0.0000], [0.0130, 0.0223, 0.0073, ..., 0.0000, 0.0000, 0.0000]])

outputs["pred_boxes"].cpu() tensor([[[0.4588, 0.6766, 0.0010, 0.0010], [0.5302, 0.6559, 0.0010, 0.0010], [0.5568, 0.6461, 0.0010, 0.0010], ..., [0.8817, 0.3877, 0.0010, 0.0010], [0.1955, 0.2195, 0.0010, 0.0010], [0.2651, 0.5368, 0.0010, 0.0010]]])

--------------- There are models that i used: -------------------------- parser.add_argument( "--ram_checkpoint", type=str, default="./models/ram_swin_large_14m.pth" , help="path to checkpoint file" ) parser.add_argument( "--grounded_checkpoint", default= "./models/groundingdino_swint_ogc.pth", type=str, help="path to checkpoint file" ) parser.add_argument( "--sam_checkpoint", default="./models/sam_vit_h_4b8939.pth" ,type=str, help="path to checkpoint file" ) parser.add_argument( "--sam_hq_checkpoint", type=str, default=None, help="path to sam-hq checkpoint file" ) parser.add_argument( "--use_sam_hq", default=False, action="store_true", help="using sam-hq for prediction" )