IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.79k stars 689 forks source link

GroundingDINO module needs to be built for every prediction request ? #333

Open aganesh9 opened 5 months ago

aganesh9 commented 5 months ago

Hi, I am following the example code here to setup the GroundingDINO inferencing in triton. I am trying to run this line just once in my deployment, since the model shouldn't have to be built for every single prediction request. However, the inference only works correctly if I run it for each request. If I build just once, the first request is successful, and the subsequent requests return an invalid bounding box.

What is happening inside the build_model method that makes the model unusable for the next request ? It is not taking any input specific parameters it looks like ?

Snippet I am using for reference:

     def __init__(self):
        self.cpu_only = False
        self.model_config_path = current_directory + '/GroundingDINO/groundingdino/config/GroundingDINO_SwinB_cfg.py' 
        self.model_checkpoint_path = current_directory + '/GroundingDINO/weights/groundingdino_swinb_cogcoor.pth'
        self.model = None
        self.device = 'cuda'

    def initialize(self, args):
        args = SLConfig.fromfile(self.model_config_path) 
        args.device = "cuda" if not self.cpu_only else "cpu"
        self.model = build_model(args)
        checkpoint = torch.load(self.model_checkpoint_path, map_location=self.device)
        load_res = self.model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
        _ = self.model.eval()
        self.model = self.model.to(self.device)

    def execute(self, requests):
       #Parsing request...
      # encoder texts
            tokenized = self.model.tokenizer(captions, padding="longest", return_tensors="pt").to(self.device)
            specical_tokens = self.model.tokenizer.convert_tokens_to_ids(["[CLS]", "[SEP]", ".", "?"])

            (
            text_self_attention_masks,
            position_ids,
            cate_to_token_mask_list,
            ) = generate_masks_with_special_tokens_and_transfer_map(
            tokenized, specical_tokens, self.model.tokenizer)

            if text_self_attention_masks.shape[1] > self.model.max_text_len:
                text_self_attention_masks = text_self_attention_masks[
                :, : self.model.max_text_len, : self.model.max_text_len]

            position_ids = position_ids[:, : self.model.max_text_len]
            tokenized["input_ids"] = tokenized["input_ids"][:, : self.model.max_text_len]
            tokenized["attention_mask"] = tokenized["attention_mask"][:, : self.model.max_text_len]
            tokenized["token_type_ids"] = tokenized["token_type_ids"][:, : self.model.max_text_len]

            with torch.no_grad():
                outputs = self.model(image[None], tokenized["input_ids"],tokenized["attention_mask"], 
                     position_ids,tokenized["token_type_ids"], text_self_attention_masks)
            #Continue processing...
NielsRogge commented 5 months ago

Hi,

See https://github.com/IDEA-Research/GroundingDINO/issues/321 for easy inference, supports batched inference

zfigov commented 5 months ago

I think that this maybe due to a bug - in groundingdino.py image The features of the image set the first time. The next time the features exist and therefore the new image isn't updated.