How cause the gap about labels of TalkSet?

TaoRuijie / TalkNet-ASD

ACM MM 2021: 'Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection'

MIT License

312 stars 70 forks source link

How cause the gap about labels of TalkSet? #4

Closed gd2016229035 closed 3 years ago

gd2016229035 commented 3 years ago

Thank you for your job and very detailed explanation! I have download the voxceleb2 & lrs3, and used your 'generate_TalkSet.py' code and successfully got the TalkSet! But it looks like a difference in labels between 'TAudio.txt' generated by myself with the same file in 'lists_out'. Does this Gap have a big impact on training a 'in the wild' model? My labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.05 0 5.05 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.63 0 5.63 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 5.95 0 5.95 0 0 Your labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.12 0 5.12 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.69 0 5.69 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 6.01 0 6.01 0 0

TaoRuijie commented 3 years ago

Wow glad to know that this code can work since I did not check it.

I think that is fine, the reason is that I random split the clips into two-part, this change-point is random, so the list is not fixed when you generate. You can see I use many random functions in the code. I think will not affect the result too much.

Need to notice that there is something different between the data loader in TalkSet and that in AVA. I can share our dataloader code for your reference (the following link). Hope this can help you.

dataLoaderTalkSet.zip

Also provide some explanation: During training, I randomly select 1,2,4,6 seconds of data for training. (In demoTalkNet.py you can find that I use the same method in evaluation). Also, I use MUSAN and RIR for data augmentation (In AVA, I did not use additional augmentation data). The other parts are similar I think.

The preparation of these two augmentation data can be found here: https://github.com/joonson/voxceleb_unsupervised

gd2016229035 commented 3 years ago

Thank you for kindly explanation! I will carefully read your dataloder code and augmentation strategy!! About the Gap of TAudio labels, I read the 'generate_TAudio' function but do not find any 'random' code.

TaoRuijie commented 3 years ago

Sorry my bad, I made the wrong explanation.

Because in VoxCeleb2, I find some video and audio is not synchronized (the length is very different), and there is some broken data. So I throw these broken data and extract the audio data from VoxCeleb2 (1091175 utterances) to replace the original VoxCeleb2 audio data (1092009 utterances). Since I did that step, the length of some .wav is slightly different from the original length.

I think you can ignore this difference. If you also meet the broken video or the condition that audio and video have very different lengths, you can just skip that data, the amount of that kind of data is not so big.

Sorry, forget to mention it in this repository..

gd2016229035 commented 3 years ago

Thank you ✖️ 10086! : ) I will ignore these very little dirty data and try to retrain the model based on TalkSet generated by myself. I will share the result if training is over.

TaoRuijie commented 3 years ago

Sure, good luck!

gd2016229035 commented 3 years ago

I modify the "batch_size = int(250 / length_total)" to "batch_size = int(60 / length_total)" to fit my 16G GPU. Other settings not change. The Avgerage F1 of my trained model is 96.06 on ColumbiaASD. By contrast, 96.04 F1 using your pretrained model, which maybe means the result are reproducible through your guidance. : )

TaoRuijie commented 3 years ago

Wow, glad to hear that. Also, you can use your trained model for the demoTalkNet.py to check the performance of the videos in the wild.

Good luck for your future research!