Closed richardrl closed 1 week ago
We observed that the ViTDet model has the highest accuracy & robustness across most scenarios. With that being said, it can be an overkill in cases where it is easy to detect the person.
We already provide a second option for the body detector (by setting --body_detector ='regnety'
) which can accelerate detection. Other person detectors (e.g., YOLO) could work equally well in most cases, while also being faster.
Hi @geopavlakos thanks for your response!
Are you using person detectors because there are no good hand detectors available?
It seems the only point of the person detector is to provide a cropped image for better VitPose hand keypoint detection.
Also, do you know of any way to leverage knowledge of the hand centerpoint labels? I am trying to label Ego4D, and the hand centerpoint is labelled although bounding boxes (with scale) are not given.
We experimented with both strategies for hand detection (using a hand bbox detection vs using a wholebody keypoint detector). I had a better experience with the strategy we follow in the demo, although each one has different failure modes (please check the caption here for a more in-depth analysis).
Even if you have the center of the hand, you will also need to estimate the scale. The most straightforward solution would probably be to use something like what the demo does for hand detection and then just verify each detection based on the annotated hand center.
Hi, I am using the detectron2 cascade mask rcnn vitdet h checkpoint in the demo for hamer and it is very slow for inference on one image compared to detectron2 model zoo numbers. I get .78s/im vs the model zoo ~.2s/im
This is the line I timed: https://github.com/geopavlakos/hamer/blob/dc19e5686198a7c3fc3938bff3951f238a85fd11/demo.py#L81
I have an RTX A5000
Is there a setting in the config file that can speed things up?