dessa-oss / fake-voice-detection

Using temporal convolution to detect Audio Deepfakes
http://www.atlas.dessa.com
Apache License 2.0
350 stars 86 forks source link

All real/authentic audios in real subfolder are classified as 'fake' with the pre-trained model #10

Open irdance opened 4 years ago

irdance commented 4 years ago

Hi when I ran the inference.py file with all the audio files in the 'real' subfolder - it misclassified them as 'fake'. I just wanted to check that the pre-trained model is the correct one?

headcrabz commented 4 years ago

yea I got the same thing, pre trained model seems to be inaccurate

ranasac19878 commented 4 years ago

Hi guys, Thanks for pointing out this issue. Currently, the pretrained model only works for audio files that are in distribution of input data it was trained on. We deliberately provided hard and out of distribution audio files in the 'real' and 'fake' subfolders to show that this work is still in progress. It is very hard to train a generalized model that will work on any audio files out of the blue. It will be great if you guys can put in some ideas.

Thanks, Sachin

irdance commented 4 years ago

Thanks @ranasac19878 just for clarity was the pre-trained model trained on the test dataset as the model does quite well on the test dataset. The test set does contain 'out of distribution' audio files as some of the fake audio files in the test set are generated from different deepfake audio models.

My hunch is that the variety of accents in the dataset (train + test) is limited and therefore may not work well with different accents.

ranasac19878 commented 4 years ago

@irdance the model was not trained on test data set but the test set was used as a validation set to set the hyperparams for the neural network. It is not technically correct to do so but it was very difficult to get a model perform good on test otherwise since the distribution of test set is different than validation set.

In the way forward, we will be working to make the model more resilient using adversarial training and other data augmentation techniques.

Yes speech accent is definitely an indication of distrbution difference but there may be some other small differences in the distribution like the number of pauses, time between pauses etc. that the model might have overfitted on given the training data.

Sachin

thaya-k commented 4 years ago

Hi, I placed all my audios (both natural and synthesized; total 280) in the path "/data/inference_data/unlabeled" and used the pre-trained model for the classification. Since I am using the terminal mode (Ubuntu), I can't see the "print out with information on predictions of the model, the accuracy of the model on your provided data." However, the result shows likelihood values (correct me if I'm wrong) with a sentence "The probability of the clip being real is: 0.00%". How can I interpret the results? P.S. I have attached the results in a graph format with likelihood values.

Screen Shot 2020-10-26 at 11 59 37 AM Picture1
ranasac19878 commented 4 years ago

Hi Thaya,

Thanks for the info. Currently, the pretrained model works well only for data it is trained/ validated on. If the data distrubution changes, this model will always default its prediction to fake since the original data had 1:9 ratio or real to fake audio clips. We are working on training another model that will work for out data distribution audio clips in coming months.

The likelihood value is the model propensity score of a clip being real or not.

Thanks, Sachin

yzslry commented 2 years ago

The link to download the asv data in this project seems to be invalid. Can you provide the data or link in the project?