astorfi / lip-reading-deeplearning

:unlock: Lip Reading - Cross Audio-Visual Recognition using 3D Architectures
Apache License 2.0
1.84k stars 321 forks source link

Online pair selection? #11

Closed bg193 closed 6 years ago

bg193 commented 6 years ago

Hi , In your paper, Pair selection algorithm to select main contributing impostor pairs which is imp_dis < (max_gen+ margin). Could you clarify why it's a main contributing impostor pairs?

astorfi commented 6 years ago

@xuehui Intuitively, if the distance (in the output feature space) between two elements of an impostor pair is more than a defined threshold (here we selected max_gen+ margin in which max_gen is the maximum distance value considering all genuine pairs in a mini-batch), then technically there is no need for that pair to contribute to gradient update because it already satisfy the desired distance metric. In another word, they are far enough!!

Empirically, we found this method to improve the results due to choosing the samples which can contribute for training optimization.

bg193 commented 6 years ago

@astorfi Thank you very much for you response! I found in train.py set hard_margin=10, no adaptive threshold to use as said in your paper?

Also, I tried my training data(training 300, testing 100 samples), but trace like this:

Epoch 1, Minibatch 1 of 9 , Minibatch Loss= 2279.155273, EER= 0.44444, AUC= 0.70238, AP= 0.67171, contrib = 32 pairs Epoch 1, Minibatch 2 of 9 , Minibatch Loss= 1624.106689, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 32 pairs Epoch 1, Minibatch 3 of 9 , Minibatch Loss= 657.279663, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 32 pairs Epoch 1, Minibatch 4 of 9 , Minibatch Loss= 315.285553, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 21 pairs Epoch 1, Minibatch 5 of 9 , Minibatch Loss= 317.505005, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 20 pairs Epoch 1, Minibatch 6 of 9 , Minibatch Loss= 120.285881, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 22 pairs Epoch 1, Minibatch 7 of 9 , Minibatch Loss= 105.461502, EER= 0.00000, AUC= 1.00000, AP= 1.00000, contrib = 20 pairs /home/bd/anaconda3/envs/avlr/lib/python3.6/site-packages/sklearn/metrics/ranking.py:539: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningledss UndefinedMetricWarning) Error: <class 'ValueError'> Error: <class 'ValueError'> TESTING: Epoch 1, Minibatch 1 of 3 TESTING: Epoch 1, Minibatch 2 of 3 TESTING: Epoch 1, Minibatch 3 of 3 TESTING: Epoch 1, EER= [ 0.], AUC= [ 1.], AP= [ 1.]

It very weird, second batch get "EER= 0.00000, AUC= 1.00000, AP= 1.00000" .

astorfi commented 6 years ago

On what dataset you are training your model? Pair selection starts from line 551 in train.py file.

bg193 commented 6 years ago

@astorfi I'm just wondering about the Pair selection algorithm and there's no dynamic adaptation threshold in train.py file.

astorfi commented 6 years ago

@xuehui Yes, I confirm ... The pair selection is there, but the dynamic pair selection has recently been developed and is in the paper. This repository has not been updated regarding that. Sorry for the inconvenience. However, the code is there and it can be modified for the dynamic method. Please feel free to modify that or please create a pull request for it in case you did it. I do not work on this project anymore, so I do not have access to the data for repeating the experiments anymore. Thanks for pointing it out.

astorfi commented 6 years ago

@xuehui However, your results are a little bit weird! It shouldn't be like that. Your data is pretty small (training 300, testing 100 samples). It may soon become over-fitted and may also be great for such a small test set. Make sure that for the test, you do not do any pair selection! It should only operate in training phase. I will personally check the code for such issue. Let me know if you found anything.

For training that error makes sense because, after some stage, there is no contributing impostor pair and scikit-learn may shoot error for not having any negative pair. The code used to have exceptions for that but apparently in the last updates, they are not there. I will try to modify that part of such error. The code also needs to be upgraded to recent TensorFlow version.

bg193 commented 6 years ago

@astorfi I suspect the method of generating data is not correct. About this I have following questions. When to produce the training/testing data , whether need to disrupt order of genuine pairs and impostor pairs? Standardization is for all the data or batch data? Thanks.

astorfi commented 6 years ago

What method of generating data? Would you please be more specific? 1- What do you mean by disrupting the order of genuine pairs and impostor pairs? 2- There is no standardization. As you can see, it is commented in the code. In case we have a standardization, one way is to calculate the mean and variance over the whole training data and applied on the test data. In-place standardization can be applied as well.

bg193 commented 6 years ago

I use the following process to generate training/test data:

  1. Produce true data from one video which video-audio pairs (9 images and 15 mfec feaures) is synced.
  2. Produce false data from same video which audio shifted 1 second.
  3. Mix true/false data together and disorder.

Is this the correct way to generate training/test data?

astorfi commented 6 years ago

@xuehui

I believe it is correct if you are sure that the 9 images and 15 MFEC features are exactly synced. For example, no Voice Activity Detection must have been performed on the sound data. Moreover, a very important point is the use of non-overlapping hamming windows for generating speech features. Please make sure the numbers of genuine and impostor pairs are roughly the same. Besides, the 1-second time shift can also be decreased to make sure the scenario is challenging.

bg193 commented 6 years ago

@astorfi I have doubt about Voice Activity Dectection, If the people in the video clip didn't talk(so there is no sound), but can detect the mouth shape.Can this video clip be used as an effective training data ?

astorfi commented 6 years ago

No, it cannot ... That's what I am mentioning. The video clips and their corresponding audio clips must be synced. Otherwise, the preprocessing becomes complicated. The BBC-Oxford 'Lip Reading in the Wild' (LRW) Dataset is a pretty clean dataset for this task although the sounds must be extracted from the videos manually using FFmpeg or similar packages. For the scenarios that you mentioned you should ignore the parts of 1) silence and 2) no lip motions.

bg193 commented 6 years ago

@astorfi Thank you !