Junhua-Liao / Light-ASD

The repository for IEEE CVPR 2023 (A Light Weight Model for Active Speaker Detection)
MIT License
101 stars 14 forks source link

Test code for LoCoNet using Columbia dataset #6

Closed syl4356 closed 1 year ago

syl4356 commented 1 year ago

Hello, thank you for sharing your incredible work!

I'm now trying to reproduce the test performance of LoCoNet and Light-ASD on Columbia dataset, using the code that you provided. The performance of LoCoNet that I tested was lower than you provided in the paper, so I have two questions regarding the test process.

  1. How did you select the context speakers (2 faces except for the target speaker)? If there were more than 3 faces in the same frame, I randomly selected two faces from those detected faces. And I repeated the target speaker's face if there were not enough faces in the scene. By this, I could only achieve average mAP 50.22 using the pretrained LoCoNet model (trained on AVA only) provided in their github. Is there any difference between your evaluation method and mine?

  2. Do the hyperparameters - facedetScale, minTrack, numFailedDet, minFaceSize, cropScale, ... - need to be tuned for Columbia dataset? By looking at the visualized result of the test code, I noticed that some bounding boxes for small faces don't show up. Do I need to change the hyperparameters in the parser to detect these? Or is it okay to leave it just as it is?

Thank you for your answer in advance!

Junhua-Liao commented 1 year ago

Thank you for your interest in our work.

  1. When we were writing the paper, the LoCoNet source code had not been open-sourced. Therefore, the results regarding LoCoNet in the paper were directly copied from the original paper.
  2. We adopted the testing code from TalkNet for our experiments on the Columbia dataset, so I believe it is appropriate to keep the hyperparameters unchanged.
syl4356 commented 1 year ago

Thank you for your reply 😄