IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.24k stars 649 forks source link

GroundingDINO Python Package #88

Open FANGAreNotGnu opened 1 year ago

FANGAreNotGnu commented 1 year ago

Hi, thanks for the great work! Is there a plan for the official pypi release?

rentainhe commented 1 year ago

Hi, thanks for the great work! Is there a plan for the official pypi release?

Sure, for more convenient usage, we will try to update a pypi version for the users in the future release

tonyhoo commented 1 year ago

Would like to have this available in PyPI as well for easy installation

giswqs commented 1 year ago

I have added the package to PyPI. Will try to get it on conda-forge as well. I would be happy add maintainers to the package if anyone is interested.

PyPI: https://pypi.org/project/groundingdino-py GitHub: https://github.com/giswqs/GroundingDINO

pip install groundingdino-py

PS: There are some other packages on PyPI with the name groundingdino in it, so I had to use an alternative package name groundingdino-py as PyPI does not allow the groundingdino name.

I wanted to add groundingdino to PyPI for the downstream package segment-geospatial https://github.com/opengeos/segment-geospatial/issues/62#issuecomment-1557617029

yeldarby commented 1 year ago

I have added the package to PyPI. Will try to get it on conda-forge as well. I would be happy add maintainers to the package if anyone is interested.

@giswqs - looks like that strips out the CUDA stuff so only runs on CPU (but with no warning); is that correct?

giswqs commented 1 year ago

@yeldarby I think it can still utilize GPU. The GPU installation is handled by torch-gpu, so GroundingDINO does not have to handle it.

I learned it from @darshats at https://github.com/IDEA-Research/GroundingDINO/issues/8?utm_source=pocket_saves#issuecomment-1555930299 and his repo https://github.com/IDEA-Research/GroundingDINO/compare/main...darshats:GroundingDINO:main

I have been using GroundingDINO with the samgeo-geospatial package. It seems working fine. https://samgeo.gishub.org/examples/text_prompts/

yeldarby commented 1 year ago

The GPU installation is handled by torch-gpu, so GroundingDINO does not have to handle it.

GroundingDINO has these custom C++ and CUDA files: https://github.com/IDEA-Research/GroundingDINO/tree/main/groundingdino/models/GroundingDINO/csrc

Are the compiled versions of those not needed to run with GPU?

giswqs commented 1 year ago

It failed to compile on my Linux machine with the latest cuda, and that's why I had to remove those cuda stuff from GroundingDINO. After that, the installation went smoothly, and I was able to use it with SAM. It is pretty fast. See the example below. However, I am not sure it GroundingDINO uses GPU or not in this case as I am not a GroundingDINO expert. It would be great if GroundingDINO can make the installation a bit more smooth. There are many installation related issues reported here, and I spent hours trying to install it.

https://github.com/IDEA-Research/GroundingDINO/assets/5016453/25f6766d-312e-47cd-820f-28f9af8ba6b6

yeldarby commented 1 year ago

It would be great if GroundingDINO can make the installation a bit more smooth. There are many installation related issues reported here, and I spent hours trying to install it.

Definitely agree! I've been trying to get generic wheels to build linked to various versions of PyTorch and CUDA with torch-extension-builder, but haven't quite been able to get it working.

I posted a bounty on Replit as a bit of an incentive if anyone wants to make the install more robust! https://replit.com/bounties/@roboflow/package-open-source

yeldarby commented 1 year ago

As a followup, I ran a benchmark:

So the GPU acceleration definitely makes a big difference (& the fork appears to be running mostly on GPU). That's probably good enough for my purposes! Besides the 15% slowdown, the only downside is needing to restart the runtime after installing due to The following packages were previously imported in this runtime: [cycler,pyparsing].

Do users have to supply their own config/GroundingDINO_SwinT_OGC.py to use your package from pip? Or is there an easy way to use the bundled one?

giswqs commented 1 year ago

@yeldarby Thanks for sharing the benchmark. It is great to know the pip package does run on GPU.

The pip package already includes the config files. The package only removes the cuda stuff to make the installation easier. All other files remain the same as the original GroundingDINO repo. See https://github.com/giswqs/GroundingDINO/tree/main/groundingdino/config

rohit901 commented 1 year ago

Thanks a lot for providing the pip package @giswqs. Even I faced a lot of issues while trying to compile/install this package on my remote machine. This pip package seems to work properly. Can the authors @rentainhe @SlongLiu update this pip installation in the README?

giswqs commented 1 year ago

If anyone wants to be the maintainer of the pypi package, please let me know. I would be happy to add maintainers or transfer the ownership.

rentainhe commented 1 year ago

Thanks a lot for providing the pip package @giswqs. Even I faced a lot of issues while trying to compile/install this package on my remote machine. This pip package seems to work properly. Can the authors @rentainhe @SlongLiu update this pip installation in the README?

Sure! Thank you so much for providing this!!! We will highlight it in README and Pin this Issue, you can refine the issue name to let more people know this update~

rohit901 commented 1 year ago

I would like to update on this pip package @giswqs @rentainhe Yesterday I did not try running inference while using the installation from the pip package, I had just imported the module and it was loading fine so I thought it should be working fine.

However, today I was trying to run inference on the model on GPU after installing this library using the pypi package provided by @giswqs, the moment I import the library

from groundingdino.util.inference import load_model, load_image, predict, annotate

I get following warning that it has failed to load custom C++ ops.

[/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31): UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
  warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")

Further when trying to run inference by passing some images to the model on GPU, i'm getting the following error and the code is not working:

with NameError: name '_C' is not defined complete logs:

8 with torch.no_grad():
----> 9     output = model(image, captions = TEXT_PROMPT_LIST)
     11 prediction_logits = output["pred_logits"].cpu().sigmoid()  # prediction_logits.shape = (batch, nq, 256)
     12 prediction_boxes = output["pred_boxes"].cpu() # prediction_boxes.shape = (batch, nq, 4)

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102), in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py:313](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py:313), in GroundingDINO.forward(self, samples, targets, **kw)
    310         poss.append(pos_l)
    312 input_query_bbox = input_query_label = attn_mask = dn_meta = None
--> 313 hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
    314     srcs, masks, input_query_bbox, poss, input_query_label, attn_mask, text_dict
    315 )
    317 # deformable-detr-like anchor update
    318 outputs_coord_list = []

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102), in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:258](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:258), in Transformer.forward(self, srcs, masks, refpoint_embed, pos_embeds, tgt, attn_mask, text_dict)
    253 enc_topk_proposals = enc_refpoint_embed = None
    255 #########################################################
    256 # Begin Encoder
    257 #########################################################
--> 258 memory, memory_text = self.encoder(
    259     src_flatten,
    260     pos=lvl_pos_embed_flatten,
    261     level_start_index=level_start_index,
    262     spatial_shapes=spatial_shapes,
    263     valid_ratios=valid_ratios,
    264     key_padding_mask=mask_flatten,
    265     memory_text=text_dict["encoded_text"],
    266     text_attention_mask=~text_dict["text_token_mask"],
    267     # we ~ the mask . False means use the token; True means pad the token
    268     position_ids=text_dict["position_ids"],
    269     text_self_attention_masks=text_dict["text_self_attention_masks"],
    270 )
    271 #########################################################
    272 # End Encoder
    273 # - memory: bs, \sum{hw}, c
   (...)
    277 # - enc_intermediate_refpoints: None or (nenc+1, bs, nq, c) or (nenc, bs, nq, c)
    278 #########################################################
    279 text_dict["encoded_text"] = memory_text

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102), in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:576](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:576), in TransformerEncoder.forward(self, src, pos, spatial_shapes, level_start_index, valid_ratios, key_padding_mask, memory_text, text_attention_mask, pos_text, text_self_attention_masks, position_ids)
    574 # main process
    575 if self.use_transformer_ckpt:
--> 576     output = checkpoint.checkpoint(
    577         layer,
    578         output,
    579         pos,
    580         reference_points,
    581         spatial_shapes,
    582         level_start_index,
    583         key_padding_mask,
    584     )
    585 else:
    586     output = layer(
    587         src=output,
    588         pos=pos,
   (...)
    592         key_padding_mask=key_padding_mask,
    593     )

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/utils/checkpoint.py:211](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/utils/checkpoint.py:211), in checkpoint(function, *args, **kwargs)
    208 if kwargs:
    209     raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))
--> 211 return CheckpointFunction.apply(function, preserve, *args)

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/utils/checkpoint.py:90](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/utils/checkpoint.py:90), in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
     87 ctx.save_for_backward(*tensor_inputs)
     89 with torch.no_grad():
---> 90     outputs = run_function(*args)
     91 return outputs

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102), in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:785](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py:785), in DeformableTransformerEncoderLayer.forward(self, src, pos, reference_points, spatial_shapes, level_start_index, key_padding_mask)
    780 def forward(
    781     self, src, pos, reference_points, spatial_shapes, level_start_index, key_padding_mask=None
    782 ):
    783     # self attention
    784     # import ipdb; ipdb.set_trace()
--> 785     src2 = self.self_attn(
    786         query=self.with_pos_embed(src, pos),
    787         reference_points=reference_points,
    788         value=src,
    789         spatial_shapes=spatial_shapes,
    790         level_start_index=level_start_index,
    791         key_padding_mask=key_padding_mask,
    792     )
    793     src = src + self.dropout1(src2)
    794     src = self.norm1(src)

File [~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/.conda/envs/RNCDL/lib/python3.8/site-packages/torch/nn/modules/module.py:1102), in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:338](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:338), in MultiScaleDeformableAttention.forward(self, query, key, value, query_pos, key_padding_mask, reference_points, spatial_shapes, level_start_index, **kwargs)
    335     sampling_locations = sampling_locations.float()
    336     attention_weights = attention_weights.float()
--> 338 output = MultiScaleDeformableAttnFunction.apply(
    339     value,
    340     spatial_shapes,
    341     level_start_index,
    342     sampling_locations,
    343     attention_weights,
    344     self.im2col_step,
    345 )
    347 if halffloat:
    348     output = output.half()

File [~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:53](https://vscode-remote+ssh-002dremote-002b10-002e127-002e30-002e125.vscode-resource.vscode-cdn.net/home/rohit.bharadwaj/Documents/Projects/GroundingDINO/~/Documents/Projects/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:53), in MultiScaleDeformableAttnFunction.forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step)
     42 @staticmethod
     43 def forward(
     44     ctx,
   (...)
     50     im2col_step,
     51 ):
     52     ctx.im2col_step = im2col_step
---> 53     output = _C.ms_deform_attn_forward(
     54         value,
     55         value_spatial_shapes,
     56         value_level_start_index,
     57         sampling_locations,
     58         attention_weights,
     59         ctx.im2col_step,
     60     )
     61     ctx.save_for_backward(
     62         value,
     63         value_spatial_shapes,
   (...)
     66         attention_weights,
     67     )
     68     return output

NameError: name '_C' is not defined

~~Perhaps this is because some of the CUDA related files was removed while compiling the pypi package @giswqs ? Is it possible for someone to provide an updated pypi package by compiling with these CUDA files as well?~~

rohit901 commented 1 year ago

Sorry for the above comment and confusion..., I verified with inference and it seems to be working fine.

I was running the code from wrong directory so it was trying to get the modules not from the pypi package but from the current directory instead.

So to confirm, the above pypi package seems to work even with GPU. Thank you again @giswqs

giswqs commented 1 year ago

I have added groundingdino to conda-forge. It can be easily installed with conda. Let me know if anyone is interested in becoming a maintainer of the conda-forge package.

mamba install -c conda-forge groundingdino-py
ash368 commented 1 year ago

I have added the package to PyPI. Will try to get it on conda-forge as well. I would be happy add maintainers to the package if anyone is interested.

PyPI: https://pypi.org/project/groundingdino-py GitHub: https://github.com/giswqs/GroundingDINO

pip install groundingdino-py

PS: There are some other packages on PyPI with the name groundingdino in it, so I had to use an alternative package name groundingdino-py as PyPI does not allow the groundingdino name.

I wanted to add groundingdino to PyPI for the downstream package segment-geospatial opengeos/segment-geospatial#62 (comment)

this saved me, thank u

MLRadfys commented 11 months ago

Hi and thanks for providing Grounding Dino as a pip package @giswqs !

I compared the inference output of the original repo with the output of the package, and it doesn't seem to be the same. For label inference both the original repo and the pip package give the same result, nevertheless the pip package seems to have issues with sentence prompts.

Does anyone else encountered this issue?

Cheers,

M

xiaobanni commented 10 months ago

Hi and thanks for providing Grounding Dino as a pip package @giswqs !

I compared the inference output of the original repo with the output of the package, and it doesn't seem to be the same. For label inference both the original repo and the pip package give the same result, nevertheless the pip package seems to have issues with sentence prompts.

Does anyone else encountered this issue?

Cheers,

M

Can you provide some specific examples for better analysis?

iorileslie commented 5 months ago

how to install or use on arm64?