Hello, thank you for sharing your incredible work!
I'm now trying to reproduce the test performance of LoCoNet and Light-ASD on Columbia dataset, using the code that you provided. The performance of LoCoNet that I tested was lower than you provided in the paper, so I have two questions regarding the test process.
How did you select the context speakers (2 faces except for the target speaker)?
If there were more than 3 faces in the same frame, I randomly selected two faces from those detected faces.
And I repeated the target speaker's face if there were not enough faces in the scene.
By this, I could only achieve average mAP 50.22 using the pretrained LoCoNet model (trained on AVA only) provided in their github.
Is there any difference between your evaluation method and mine?
Do the hyperparameters - facedetScale, minTrack, numFailedDet, minFaceSize, cropScale, ... - need to be tuned for Columbia dataset?
By looking at the visualized result of the test code, I noticed that some bounding boxes for small faces don't show up. Do I need to change the hyperparameters in the parser to detect these? Or is it okay to leave it just as it is?
When we were writing the paper, the LoCoNet source code had not been open-sourced. Therefore, the results regarding LoCoNet in the paper were directly copied from the original paper.
We adopted the testing code from TalkNet for our experiments on the Columbia dataset, so I believe it is appropriate to keep the hyperparameters unchanged.
Hello, thank you for sharing your incredible work!
I'm now trying to reproduce the test performance of LoCoNet and Light-ASD on Columbia dataset, using the code that you provided. The performance of LoCoNet that I tested was lower than you provided in the paper, so I have two questions regarding the test process.
How did you select the context speakers (2 faces except for the target speaker)? If there were more than 3 faces in the same frame, I randomly selected two faces from those detected faces. And I repeated the target speaker's face if there were not enough faces in the scene. By this, I could only achieve average mAP 50.22 using the pretrained LoCoNet model (trained on AVA only) provided in their github. Is there any difference between your evaluation method and mine?
Do the hyperparameters - facedetScale, minTrack, numFailedDet, minFaceSize, cropScale, ... - need to be tuned for Columbia dataset? By looking at the visualized result of the test code, I noticed that some bounding boxes for small faces don't show up. Do I need to change the hyperparameters in the parser to detect these? Or is it okay to leave it just as it is?
Thank you for your answer in advance!