Closed gd2016229035 closed 3 years ago
Wow glad to know that this code can work since I did not check it.
I think that is fine, the reason is that I random split the clips into two-part, this change-point
is random, so the list is not fixed when you generate. You can see I use many random
functions in the code. I think will not affect the result too much.
Need to notice that there is something different between the data loader in TalkSet and that in AVA. I can share our dataloader code for your reference (the following link). Hope this can help you.
Also provide some explanation: During training, I randomly select 1,2,4,6 seconds of data for training. (In demoTalkNet.py you can find that I use the same method in evaluation). Also, I use MUSAN and RIR for data augmentation (In AVA, I did not use additional augmentation data). The other parts are similar I think.
The preparation of these two augmentation data can be found here: https://github.com/joonson/voxceleb_unsupervised
Thank you for kindly explanation! I will carefully read your dataloder code and augmentation strategy!! About the Gap of TAudio labels, I read the 'generate_TAudio' function but do not find any 'random' code.
Sorry my bad, I made the wrong explanation.
Because in VoxCeleb2, I find some video and audio is not synchronized (the length is very different), and there is some broken data. So I throw these broken data and extract the audio data from VoxCeleb2 (1091175 utterances) to replace the original VoxCeleb2 audio data (1092009 utterances). Since I did that step, the length of some .wav
is slightly different from the original length.
I think you can ignore this difference. If you also meet the broken video or the condition that audio and video have very different lengths, you can just skip that data, the amount of that kind of data is not so big.
Sorry, forget to mention it in this repository..
Thank you ✖️ 10086! : ) I will ignore these very little dirty data and try to retrain the model based on TalkSet generated by myself. I will share the result if training is over.
Sure, good luck!
I modify the "batch_size = int(250 / length_total)" to "batch_size = int(60 / length_total)" to fit my 16G GPU. Other settings not change. The Avgerage F1 of my trained model is 96.06 on ColumbiaASD. By contrast, 96.04 F1 using your pretrained model, which maybe means the result are reproducible through your guidance. : )
Wow, glad to hear that. Also, you can use your trained model for the demoTalkNet.py
to check the performance of the videos in the wild.
Good luck for your future research!
Thank you for your job and very detailed explanation! I have download the voxceleb2 & lrs3, and used your 'generate_TalkSet.py' code and successfully got the TalkSet! But it looks like a difference in labels between 'TAudio.txt' generated by myself with the same file in 'lists_out'. Does this Gap have a big impact on training a 'in the wild' model? My labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.05 0 5.05 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.63 0 5.63 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 5.95 0 5.95 0 0 Your labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.12 0 5.12 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.69 0 5.69 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 6.01 0 6.01 0 0