Open WalterWangRevo opened 1 year ago
Hello, I find that the forward of the model is "outputs = model(image[None], captions=[caption])", which means the shape of image is [1,C,H,W] and captions is [str] . But when I try to make multiple images stacked to [N,C,H,W] , it cannot work . Also, I try to make captions to [str1,str2] , and not surprisingly it cannot work. It seems that the model can only accept one image and one caption from the released code. If in my situation, I want to test M images, and each image is with N captions, I need to forward MxN times which is time consuming.
We will try to update a batch inference API for the users
Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption.
@rentainhe is there a timeline for this?
Edit:
I managed to get a batch script to work in a somewhat hacky way:
Stack the images into a batch:
images = torch.stack([load_image(img)[1] for img in img_paths])
boxes, logits, phrases = predict_batch(
model=model,
images=images,
caption=TEXT_PROMPT,
box_threshold=BOX_TRESHOLD,
text_threshold=TEXT_TRESHOLD
)
You'll need to update the load_image
func to not use the RandomResize
. Inside datasets/transforms.py
, add this class:
class Resize(object):
def __init__(self, size):
assert isinstance(size, (list, tuple))
self.size = size
def __call__(self, img, target=None):
return resize(img, target, self.size)
Inside load_image
in inference.py
, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.
transform = T.Compose(
[
# T.RandomResize([800], max_size=1333),
# Added T.Resize to fix the resized image during batch inference
T.Resize((800, 1200)),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
]
)
Adapting the existing predict
function:
def predict_batch(
model,
images: torch.Tensor,
caption: str,
box_threshold: float,
text_threshold: float,
device: str = "cuda"
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
caption = preprocess_caption(caption=caption)
model = model.to(device)
image = images.to(device)
print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200])
with torch.no_grad():
outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case
print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256])
print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4])
prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0] # prediction_logits.shape = (nq, 256)
prediction_boxes = outputs["pred_boxes"].cpu()[0] # prediction_boxes.shape = (nq, 4)
mask = prediction_logits.max(dim=1)[0] > box_threshold
logits = prediction_logits[mask] # logits.shape = (n, 256)
boxes = prediction_boxes[mask] # boxes.shape = (n, 4)
tokenizer = model.tokenizer
tokenized = tokenizer(caption)
phrases = [
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
for logit
in logits
]
return boxes, logits.max(dim=1)[0], phrases
This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.
Hello, it is a interesting work. Duo to the complex model architecture, how can we quickly and intuitively obtain the overall model architecture DAG and the dimensional information of input and output features? I tried tensorboard but not succeed because multiple inputs are involved. Thank you! @ashrielbrian @rentainhe
Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption. I successfully implement the method you provided, if anyone needs it, you can reference from this link.
Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption. @rentainhe is there a timeline for this? Edit: I managed to get a batch script to work in a somewhat hacky way: Stack the images into a batch:
images = torch.stack([load_image(img)[1] for img in img_paths]) boxes, logits, phrases = predict_batch( model=model, images=images, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD )
You'll need to update the
load_image
func to not use theRandomResize
. Insidedatasets/transforms.py
, add this class:class Resize(object): def __init__(self, size): assert isinstance(size, (list, tuple)) self.size = size def __call__(self, img, target=None): return resize(img, target, self.size)
Inside
load_image
ininference.py
, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.transform = T.Compose( [ # T.RandomResize([800], max_size=1333), # Added T.Resize to fix the resized image during batch inference T.Resize((800, 1200)), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ] )
Adapting the existing
predict
function:def predict_batch( model, images: torch.Tensor, caption: str, box_threshold: float, text_threshold: float, device: str = "cuda" ) -> Tuple[torch.Tensor, torch.Tensor, List[str]]: caption = preprocess_caption(caption=caption) model = model.to(device) image = images.to(device) print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200]) with torch.no_grad(): outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4]) prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0] # prediction_logits.shape = (nq, 256) prediction_boxes = outputs["pred_boxes"].cpu()[0] # prediction_boxes.shape = (nq, 4) mask = prediction_logits.max(dim=1)[0] > box_threshold logits = prediction_logits[mask] # logits.shape = (n, 256) boxes = prediction_boxes[mask] # boxes.shape = (n, 4) tokenizer = model.tokenizer tokenized = tokenizer(caption) phrases = [ get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '') for logit in logits ] return boxes, logits.max(dim=1)[0], phrases
This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.
I successfully implement the method you provided, if anyone needs it, you can reference from this link.
Hi @yuwenmichael @ashrielbrian, I use this hacky way to implement batch inference of GroundingDino. But I was wondering why still get index at 0 in these two lines:
prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0] # prediction_logits.shape = (nq, 256)
prediction_boxes = outputs["pred_boxes"].cpu()[0] # prediction_boxes.shape = (nq, 4)
In my case, this only give me the result of first image. In order to get all results, I need to yield all results from outputs of model in predict function like this:
for idx, _ in enumerate(range(outputs["pred_boxes"].shape[0])):
prediction_logits = outputs["pred_logits"].cpu().sigmoid()[idx] # prediction_logits.shape = (nq, 256)
prediction_boxes = outputs["pred_boxes"].cpu()[idx] # prediction_boxes.shape = (nq, 4)
mask = prediction_logits.max(dim=1)[0] > box_threshold
logits = prediction_logits[mask] # logits.shape = (n, 256)
boxes = prediction_boxes[mask] # boxes.shape = (n, 4)
tokenizer = model.tokenizer
tokenized = tokenizer(caption)
if remove_combined:
sep_idx = [i for i in range(len(tokenized['input_ids'])) if tokenized['input_ids'][i] in [101, 102, 1012]]
phrases = []
for logit in logits:
max_idx = logit.argmax()
insert_idx = bisect.bisect_left(sep_idx, max_idx)
right_idx = sep_idx[insert_idx]
left_idx = sep_idx[insert_idx - 1]
phrases.append(get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer, left_idx, right_idx).replace('.', ''))
else:
phrases = [
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
for logit
in logits
]
yield boxes, logits.max(dim=1)[0], phrases
Could you please check if your result is aligned with the input?
@ashrielbrian @Kun-Ming @rentainhe @WalterWangRevo @yuwenmichael ,thanks for all your contributions. I also tried to inference multiple images by this code, I have a question about this issue. How to distinguish which picture the generated boxes belong to? Different pictures have different numbers of Bboxes? How to solve this trouble caused in the process of mutiple image objects annotation?
Hi @Andyyoung0507, I think you could try my code above. I return the prediction result by a for...yield, in each iteration it would return a result of one img.
Hi @Andyyoung0507, I think you could try my code above. I return the prediction result by a for...yield, in each iteration it would return a result of one img. Thank you! That is a good idea. Multiple images inference really brings about some difficulties that I can't deal with.
Seconding this - our use case would be to perform offline batch inference and maximise GPU throughput for M images with the same caption.
@rentainhe is there a timeline for this?
Edit:
I managed to get a batch script to work in a somewhat hacky way:
Stack the images into a batch:
images = torch.stack([load_image(img)[1] for img in img_paths]) boxes, logits, phrases = predict_batch( model=model, images=images, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD )
You'll need to update the
load_image
func to not use theRandomResize
. Insidedatasets/transforms.py
, add this class:class Resize(object): def __init__(self, size): assert isinstance(size, (list, tuple)) self.size = size def __call__(self, img, target=None): return resize(img, target, self.size)
Inside
load_image
ininference.py
, I hardcoded the resize to ensure every image in the batch is of the same size. This is hacky and probably (definitely) results in poorer performance.transform = T.Compose( [ # T.RandomResize([800], max_size=1333), # Added T.Resize to fix the resized image during batch inference T.Resize((800, 1200)), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ] )
Adapting the existing
predict
function:def predict_batch( model, images: torch.Tensor, caption: str, box_threshold: float, text_threshold: float, device: str = "cuda" ) -> Tuple[torch.Tensor, torch.Tensor, List[str]]: caption = preprocess_caption(caption=caption) model = model.to(device) image = images.to(device) print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200]) with torch.no_grad(): outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4]) prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0] # prediction_logits.shape = (nq, 256) prediction_boxes = outputs["pred_boxes"].cpu()[0] # prediction_boxes.shape = (nq, 4) mask = prediction_logits.max(dim=1)[0] > box_threshold logits = prediction_logits[mask] # logits.shape = (n, 256) boxes = prediction_boxes[mask] # boxes.shape = (n, 4) tokenizer = model.tokenizer tokenized = tokenizer(caption) phrases = [ get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '') for logit in logits ] return boxes, logits.max(dim=1)[0], phrases
This gave me a roughly 18% improvement in latency over single image inference of a batch of 16 images.
Hy to everyone, this allows you to do inference with batch size > 1, however I have a problem with reproducibility. If a certain couple (image, text) has different position in the batch then the output is different. It seems that the elements of the batch are not processed independently...
{0: {'seat': 0.4972114861011505, 'stall': 0.332093745470047, 'toilet': 0.783160388469696}, 1: {'cat': 0.45665207505226135, 'monitor': 0.5673973560333252, 'picture': 0.20609396696090698, 'that': 0.05899302288889885}, 2: {'dress': 0.777637779712677, 'man': 0.8794577121734619, 'woman': 0.723700225353241}, 3: {'grass': 0.7136312127113342, 'truck': 0.7716715335845947}, 4: {'couch': 0.7266196608543396, 'dogs': 0.6962682008743286}, 5: {'man': 0.8588370084762573, 'rail': 0.35756751894950867, 'snowy hill': 0.4643862843513489}, 6: {'back': 0.2656899094581604, 'cat': 0.843508780002594, 'sofa': 0.6899327039718628, 'window': 0.6876299977302551}, 7: {'cat': 0.46739253401756287, 'mirror': 0.4958229959011078, 'paw': 0.2748376727104187}, 8: {'kitten': 0.26511871814727783, 'mirror': 0.44254618883132935, 'paw': 0.28917092084884644, 'reflection': 0.22627031803131104}, 9: {'children': 0.2913839817047119, 'photo': 0.7339288592338562}, 10: {'chairs': 0.39893320202827454, 'dining table': 0.40886712074279785, 'room': 0.6335236430168152, 'window': 0.40863069891929626}, 11: {'pole': 0.49428969621658325, 'sign': 0.4169153571128845, 'stop lights': 0.41868656873703003}, 12: {'books': 0.3859530985355377, 'kitchen': 0.6933691501617432, 'shelf': 0.35644665360450745, 'stove': 0.6413100361824036, 'teapot': 0.8949454426765442}, 13: {'air': 0.11777491867542267, 'airplane': 0.9213575720787048, 'day': 0.3377494215965271}, 14: {'bed': 0.7950222492218018, 'bottle': 0.42745649814605713, 'child': 0.648800253868103, 'milk': 0.45026877522468567}, 15: {'dog stand': 0.5395100116729736, 'horse': 0.8022459149360657, 'meadow': 0.6759784817695618}}
{0: {'seat': 0.49454519152641296, 'stall': 0.33220207691192627, 'toilet': 0.7824963331222534}, 1: {'cat': 0.44893959164619446, 'monitor': 0.5316653251647949, 'picture': 0.14438998699188232, 'that': 0.01652955636382103}, 2: {'dress': 0.7803278565406799, 'man': 0.8828325867652893, 'woman': 0.7285552024841309}, 3: {'grass': 0.7145284414291382, 'truck': 0.7652426362037659}, 4: {'dogs': 0.6253237724304199, 'leather couch': 0.5814560651779175}, 5: {'man': 0.8741846680641174, 'snowy hill': 0.45280832052230835, 'zigzag rail': 0.4841887056827545}, 6: {'cat': 0.8463238477706909, 'sofa': 0.7171109318733215, 'velvety': 0.29014575481414795, 'window': 0.7127322554588318}, 7: {'cat': 0.4669063687324524, 'mirror': 0.4929966330528259, 'paw': 0.2748376727104187}, 8: {'kitten': 0.2649284899234772, 'mirror': 0.441763311624527, 'paw': 0.29310086369514465, 'reflection': 0.23704566061496735}, 9: {'children': 0.29624688625335693, 'photo': 0.7339288592338562}, 10: {'chairs': 0.39262789487838745, 'dining glass table': 0.4174637198448181, 'room': 0.6102792620658875, 'window': 0.4397774040699005}, 11: {'pole': 0.4657823145389557, 'sign': 0.42167091369628906, 'stop lights': 0.41197091341018677}, 12: {'books': 0.3869951069355011, 'kitchen': 0.6946134567260742, 'shelf': 0.3565586805343628, 'stove': 0.6319350600242615, 'teapot': 0.8900787830352783}, 13: {'air': 0.11124119907617569, 'airplane': 0.9212159514427185, 'day': 0.34466397762298584}, 14: {'bed': 0.7652426362037659, 'child': 0.641871452331543, 'frothy bottle': 0.5389838218688965, 'milk': 0.4690186083316803}, 15: {'dog': 0.8864204287528992, 'horse': 0.8353492617607117, 'meadow': 0.6918097138404846}}
These are examples of atoms prompted following the recommended prompt "class1 . class2 .". These prompts are associated with the same images versus the two forward pass. If you consider the first pair:
(img, prompt1), where prompt1 = "seat . stall . toilet ." (img, prompt2), where prompt2 = "seat . stall . toilet ."
(namely prompt1 == prompt2... img == img)
the probabilities are a little different...
{'seat': 0.4972114861011505, 'stall': 0.332093745470047, 'toilet': 0.783160388469696} {'seat': 0.49454519152641296, 'stall': 0.33220207691192627, 'toilet': 0.7824963331222534}
@rentainhe > We will try to update a batch inference API for the users How does one run GroundingDINO batched?
Hello, I find that the forward of the model is "outputs = model(image[None], captions=[caption])", which means the shape of image is [1,C,H,W] and captions is [str] . But when I try to make multiple images stacked to [N,C,H,W] , it cannot work . Also, I try to make captions to [str1,str2] , and not surprisingly it cannot work. It seems that the model can only accept one image and one caption from the released code. If in my situation, I want to test M images, and each image is with N captions, I need to forward MxN times which is time consuming.