IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.72k stars 681 forks source link

Can multiple images/caption be input into the model? #102

Open WalterWangRevo opened 1 year ago

WalterWangRevo commented 1 year ago

Hello, I find that the forward of the model is "outputs = model(image[None], captions=[caption])", which means the shape of image is [1,C,H,W] and captions is [str] . But when I try to make multiple images stacked to [N,C,H,W] , it cannot work . Also, I try to make captions to [str1,str2] , and not surprisingly it cannot work. It seems that the model can only accept one image and one caption from the released code. If in my situation, I want to test M images, and each image is with N captions, I need to forward MxN times which is time consuming.

rentainhe commented 1 year ago

Hello, I find that the forward of the model is "outputs = model(image[None], captions=[caption])", which means the shape of image is [1,C,H,W] and captions is [str] . But when I try to make multiple images stacked to [N,C,H,W] , it cannot work . Also, I try to make captions to [str1,str2] , and not surprisingly it cannot work. It seems that the model can only accept one image and one caption from the released code. If in my situation, I want to test M images, and each image is with N captions, I need to forward MxN times which is time consuming.

We will try to update a batch inference API for the users

ashrielbrian commented 1 year ago

Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption.

@rentainhe is there a timeline for this?

Edit:

I managed to get a batch script to work in a somewhat hacky way:

Stack the images into a batch:

images = torch.stack([load_image(img)[1] for img in img_paths])
boxes, logits, phrases = predict_batch(
        model=model,
        images=images,
        caption=TEXT_PROMPT,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD
    )

You'll need to update the load_image func to not use the RandomResize. Inside datasets/transforms.py, add this class:

class Resize(object):
    def __init__(self, size):
        assert isinstance(size, (list, tuple))
        self.size = size

    def __call__(self, img, target=None):
        return resize(img, target, self.size)

Inside load_image in inference.py, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.

transform = T.Compose(
        [
            # T.RandomResize([800], max_size=1333),
            # Added T.Resize to fix the resized image during batch inference
            T.Resize((800, 1200)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

Adapting the existing predict function:

def predict_batch(
        model,
        images: torch.Tensor,
        caption: str,
        box_threshold: float,
        text_threshold: float,
        device: str = "cuda"
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

    model = model.to(device)
    image = images.to(device)

    print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200])
    with torch.no_grad():
        outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case

    print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) 
    print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4])
    prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0]  # prediction_logits.shape = (nq, 256)
    prediction_boxes = outputs["pred_boxes"].cpu()[0]  # prediction_boxes.shape = (nq, 4)

    mask = prediction_logits.max(dim=1)[0] > box_threshold
    logits = prediction_logits[mask]  # logits.shape = (n, 256)
    boxes = prediction_boxes[mask]  # boxes.shape = (n, 4)

    tokenizer = model.tokenizer
    tokenized = tokenizer(caption)

    phrases = [
        get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
        for logit
        in logits
    ]

    return boxes, logits.max(dim=1)[0], phrases

This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.

andyoung009 commented 1 year ago

Hello, it is a interesting work. Duo to the complex model architecture, how can we quickly and intuitively obtain the overall model architecture DAG and the dimensional information of input and output features? I tried tensorboard but not succeed because multiple inputs are involved. Thank you! @ashrielbrian @rentainhe

yuwenmichael commented 1 year ago

Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption. I successfully implement the method you provided, if anyone needs it, you can reference from this link.

KunmingS commented 11 months ago

Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption. @rentainhe is there a timeline for this? Edit: I managed to get a batch script to work in a somewhat hacky way: Stack the images into a batch:

images = torch.stack([load_image(img)[1] for img in img_paths])
boxes, logits, phrases = predict_batch(
        model=model,
        images=images,
        caption=TEXT_PROMPT,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD
    )

You'll need to update the load_image func to not use the RandomResize. Inside datasets/transforms.py, add this class:

class Resize(object):
    def __init__(self, size):
        assert isinstance(size, (list, tuple))
        self.size = size

    def __call__(self, img, target=None):
        return resize(img, target, self.size)

Inside load_image in inference.py, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.

transform = T.Compose(
        [
            # T.RandomResize([800], max_size=1333),
            # Added T.Resize to fix the resized image during batch inference
            T.Resize((800, 1200)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

Adapting the existing predict function:

def predict_batch(
        model,
        images: torch.Tensor,
        caption: str,
        box_threshold: float,
        text_threshold: float,
        device: str = "cuda"
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

    model = model.to(device)
    image = images.to(device)

    print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200])
    with torch.no_grad():
        outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case

    print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) 
    print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4])
    prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0]  # prediction_logits.shape = (nq, 256)
    prediction_boxes = outputs["pred_boxes"].cpu()[0]  # prediction_boxes.shape = (nq, 4)

    mask = prediction_logits.max(dim=1)[0] > box_threshold
    logits = prediction_logits[mask]  # logits.shape = (n, 256)
    boxes = prediction_boxes[mask]  # boxes.shape = (n, 4)

    tokenizer = model.tokenizer
    tokenized = tokenizer(caption)

    phrases = [
        get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
        for logit
        in logits
    ]

    return boxes, logits.max(dim=1)[0], phrases

This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.

I successfully implement the method you provided, if anyone needs it, you can reference from this link.

Hi @yuwenmichael @ashrielbrian, I use this hacky way to implement batch inference of GroundingDino. But I was wondering why still get index at 0 in these two lines:

prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0]  # prediction_logits.shape = (nq, 256)
prediction_boxes = outputs["pred_boxes"].cpu()[0]  # prediction_boxes.shape = (nq, 4)

In my case, this only give me the result of first image. In order to get all results, I need to yield all results from outputs of model in predict function like this:

for idx, _ in enumerate(range(outputs["pred_boxes"].shape[0])):
        prediction_logits = outputs["pred_logits"].cpu().sigmoid()[idx]  # prediction_logits.shape = (nq, 256)
        prediction_boxes = outputs["pred_boxes"].cpu()[idx]  # prediction_boxes.shape = (nq, 4)

        mask = prediction_logits.max(dim=1)[0] > box_threshold
        logits = prediction_logits[mask]  # logits.shape = (n, 256)
        boxes = prediction_boxes[mask]  # boxes.shape = (n, 4)

        tokenizer = model.tokenizer
        tokenized = tokenizer(caption)

        if remove_combined:
            sep_idx = [i for i in range(len(tokenized['input_ids'])) if tokenized['input_ids'][i] in [101, 102, 1012]]

            phrases = []
            for logit in logits:
                max_idx = logit.argmax()
                insert_idx = bisect.bisect_left(sep_idx, max_idx)
                right_idx = sep_idx[insert_idx]
                left_idx = sep_idx[insert_idx - 1]
                phrases.append(get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer, left_idx, right_idx).replace('.', ''))
        else:
            phrases = [
                get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
                for logit
                in logits
            ]

        yield boxes, logits.max(dim=1)[0], phrases

Could you please check if your result is aligned with the input?

Andyyoung0507 commented 9 months ago

@ashrielbrian @Kun-Ming @rentainhe @WalterWangRevo @yuwenmichael ,thanks for all your contributions. I also tried to inference multiple images by this code, I have a question about this issue. How to distinguish which picture the generated boxes belong to? Different pictures have different numbers of Bboxes? How to solve this trouble caused in the process of mutiple image objects annotation?

KunmingS commented 9 months ago

Hi @Andyyoung0507, I think you could try my code above. I return the prediction result by a for...yield, in each iteration it would return a result of one img.

Andyyoung0507 commented 9 months ago

Hi @Andyyoung0507, I think you could try my code above. I return the prediction result by a for...yield, in each iteration it would return a result of one img. Thank you! That is a good idea. Multiple images inference really brings about some difficulties that I can't deal with.

FiorenzoParascandolo1 commented 9 months ago

Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption.

@rentainhe is there a timeline for this?

Edit:

I managed to get a batch script to work in a somewhat hacky way:

Stack the images into a batch:

images = torch.stack([load_image(img)[1] for img in img_paths])
boxes, logits, phrases = predict_batch(
        model=model,
        images=images,
        caption=TEXT_PROMPT,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD
    )

You'll need to update the load_image func to not use the RandomResize. Inside datasets/transforms.py, add this class:

class Resize(object):
    def __init__(self, size):
        assert isinstance(size, (list, tuple))
        self.size = size

    def __call__(self, img, target=None):
        return resize(img, target, self.size)

Inside load_image in inference.py, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.

transform = T.Compose(
        [
            # T.RandomResize([800], max_size=1333),
            # Added T.Resize to fix the resized image during batch inference
            T.Resize((800, 1200)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

Adapting the existing predict function:

def predict_batch(
        model,
        images: torch.Tensor,
        caption: str,
        box_threshold: float,
        text_threshold: float,
        device: str = "cuda"
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

    model = model.to(device)
    image = images.to(device)

    print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200])
    with torch.no_grad():
        outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case

    print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) 
    print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4])
    prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0]  # prediction_logits.shape = (nq, 256)
    prediction_boxes = outputs["pred_boxes"].cpu()[0]  # prediction_boxes.shape = (nq, 4)

    mask = prediction_logits.max(dim=1)[0] > box_threshold
    logits = prediction_logits[mask]  # logits.shape = (n, 256)
    boxes = prediction_boxes[mask]  # boxes.shape = (n, 4)

    tokenizer = model.tokenizer
    tokenized = tokenizer(caption)

    phrases = [
        get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
        for logit
        in logits
    ]

    return boxes, logits.max(dim=1)[0], phrases

This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.

Hy to everyone, this allows you to do inference with batch size > 1, however I have a problem with reproducibility. If a certain couple (image, text) has different position in the batch then the output is different. It seems that the elements of the batch are not processed independently...

{0: {'seat': 0.4972114861011505, 'stall': 0.332093745470047, 'toilet': 0.783160388469696}, 1: {'cat': 0.45665207505226135, 'monitor': 0.5673973560333252, 'picture': 0.20609396696090698, 'that': 0.05899302288889885}, 2: {'dress': 0.777637779712677, 'man': 0.8794577121734619, 'woman': 0.723700225353241}, 3: {'grass': 0.7136312127113342, 'truck': 0.7716715335845947}, 4: {'couch': 0.7266196608543396, 'dogs': 0.6962682008743286}, 5: {'man': 0.8588370084762573, 'rail': 0.35756751894950867, 'snowy hill': 0.4643862843513489}, 6: {'back': 0.2656899094581604, 'cat': 0.843508780002594, 'sofa': 0.6899327039718628, 'window': 0.6876299977302551}, 7: {'cat': 0.46739253401756287, 'mirror': 0.4958229959011078, 'paw': 0.2748376727104187}, 8: {'kitten': 0.26511871814727783, 'mirror': 0.44254618883132935, 'paw': 0.28917092084884644, 'reflection': 0.22627031803131104}, 9: {'children': 0.2913839817047119, 'photo': 0.7339288592338562}, 10: {'chairs': 0.39893320202827454, 'dining table': 0.40886712074279785, 'room': 0.6335236430168152, 'window': 0.40863069891929626}, 11: {'pole': 0.49428969621658325, 'sign': 0.4169153571128845, 'stop lights': 0.41868656873703003}, 12: {'books': 0.3859530985355377, 'kitchen': 0.6933691501617432, 'shelf': 0.35644665360450745, 'stove': 0.6413100361824036, 'teapot': 0.8949454426765442}, 13: {'air': 0.11777491867542267, 'airplane': 0.9213575720787048, 'day': 0.3377494215965271}, 14: {'bed': 0.7950222492218018, 'bottle': 0.42745649814605713, 'child': 0.648800253868103, 'milk': 0.45026877522468567}, 15: {'dog stand': 0.5395100116729736, 'horse': 0.8022459149360657, 'meadow': 0.6759784817695618}}

{0: {'seat': 0.49454519152641296, 'stall': 0.33220207691192627, 'toilet': 0.7824963331222534}, 1: {'cat': 0.44893959164619446, 'monitor': 0.5316653251647949, 'picture': 0.14438998699188232, 'that': 0.01652955636382103}, 2: {'dress': 0.7803278565406799, 'man': 0.8828325867652893, 'woman': 0.7285552024841309}, 3: {'grass': 0.7145284414291382, 'truck': 0.7652426362037659}, 4: {'dogs': 0.6253237724304199, 'leather couch': 0.5814560651779175}, 5: {'man': 0.8741846680641174, 'snowy hill': 0.45280832052230835, 'zigzag rail': 0.4841887056827545}, 6: {'cat': 0.8463238477706909, 'sofa': 0.7171109318733215, 'velvety': 0.29014575481414795, 'window': 0.7127322554588318}, 7: {'cat': 0.4669063687324524, 'mirror': 0.4929966330528259, 'paw': 0.2748376727104187}, 8: {'kitten': 0.2649284899234772, 'mirror': 0.441763311624527, 'paw': 0.29310086369514465, 'reflection': 0.23704566061496735}, 9: {'children': 0.29624688625335693, 'photo': 0.7339288592338562}, 10: {'chairs': 0.39262789487838745, 'dining glass table': 0.4174637198448181, 'room': 0.6102792620658875, 'window': 0.4397774040699005}, 11: {'pole': 0.4657823145389557, 'sign': 0.42167091369628906, 'stop lights': 0.41197091341018677}, 12: {'books': 0.3869951069355011, 'kitchen': 0.6946134567260742, 'shelf': 0.3565586805343628, 'stove': 0.6319350600242615, 'teapot': 0.8900787830352783}, 13: {'air': 0.11124119907617569, 'airplane': 0.9212159514427185, 'day': 0.34466397762298584}, 14: {'bed': 0.7652426362037659, 'child': 0.641871452331543, 'frothy bottle': 0.5389838218688965, 'milk': 0.4690186083316803}, 15: {'dog': 0.8864204287528992, 'horse': 0.8353492617607117, 'meadow': 0.6918097138404846}}

These are examples of atoms prompted following the recommended prompt "class1 . class2 .". These prompts are associated with the same images versus the two forward pass. If you consider the first pair:

(img, prompt1), where prompt1 = "seat . stall . toilet ." (img, prompt2), where prompt2 = "seat . stall . toilet ."

(namely prompt1 == prompt2... img == img)

the probabilities are a little different...

{'seat': 0.4972114861011505, 'stall': 0.332093745470047, 'toilet': 0.783160388469696} {'seat': 0.49454519152641296, 'stall': 0.33220207691192627, 'toilet': 0.7824963331222534}

SinanAkkoyun commented 6 months ago

@rentainhe > We will try to update a batch inference API for the users How does one run GroundingDINO batched?