Hamer demo.py bounding box speed

geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers

https://geopavlakos.github.io/hamer/

MIT License

351 stars 30 forks source link

Hamer demo.py bounding box speed #63

Closed richardrl closed 1 week ago

richardrl commented 3 weeks ago

Hi, I am using the detectron2 cascade mask rcnn vitdet h checkpoint in the demo for hamer and it is very slow for inference on one image compared to detectron2 model zoo numbers. I get .78s/im vs the model zoo ~.2s/im

This is the line I timed: https://github.com/geopavlakos/hamer/blob/dc19e5686198a7c3fc3938bff3951f238a85fd11/demo.py#L81

I have an RTX A5000

Is there a setting in the config file that can speed things up?

geopavlakos commented 3 weeks ago

We observed that the ViTDet model has the highest accuracy & robustness across most scenarios. With that being said, it can be an overkill in cases where it is easy to detect the person. We already provide a second option for the body detector (by setting --body_detector ='regnety') which can accelerate detection. Other person detectors (e.g., YOLO) could work equally well in most cases, while also being faster.

richardrl commented 2 weeks ago

Hi @geopavlakos thanks for your response!

Are you using person detectors because there are no good hand detectors available?

It seems the only point of the person detector is to provide a cropped image for better VitPose hand keypoint detection.

Also, do you know of any way to leverage knowledge of the hand centerpoint labels? I am trying to label Ego4D, and the hand centerpoint is labelled although bounding boxes (with scale) are not given.

geopavlakos commented 2 weeks ago

We experimented with both strategies for hand detection (using a hand bbox detection vs using a wholebody keypoint detector). I had a better experience with the strategy we follow in the demo, although each one has different failure modes (please check the caption here for a more in-depth analysis).

Even if you have the center of the hand, you will also need to estimate the scale. The most straightforward solution would probably be to use something like what the demo does for hand detection and then just verify each detection based on the annotated hand center.