PalaashAgrawal / v2-Fake-Speech-Detection-using-Fequency-Analysis

A CNN that classifies whether an audio segment is spoken by Donald Trump himself, or Impersonated.
0 stars 0 forks source link

Data Extraction #2

Open CO18325 opened 3 years ago

CO18325 commented 3 years ago

Hey, Can you give a little explanation on the data extraction method :

  1. What length of each audio file you took: As each file is about an Hour-long
  2. How many images you took for the Dataset.
PalaashAgrawal commented 3 years ago

Hey Inderpreet, I first downloaded each audio file (one hour longish) separately. Then I manually snipped out multiple 5 second audio snips, where there was noticeably less noise (such as audience cheer, wind, noise feedback), and the snippet contained only the speakers voice for the majority part. Though there is no restriction whether this snippet should be 5 second long only, it can be of any duration, since my goal is to convert them into frequency plots anyways (frequency plots do not care about the time duration of the input). I collected about 200 such audio snippets ( including both classes) among all the audio files that I had initially downloaded.

And then I converted each of these audio snippets into a frequency plot (the code is included in the repository). Then I manually cropped the images above 9KHz (because audio engineers usually set a high cut on the frequency well above 9Khz, and this varies according to the surroundings, so to make it all uniform, I simply ignore any frequencies above 9Khz. The natural voice of humans is anyways well under 9KHz)

See the final processed dataset here: https://github.com/PalaashAgrawal/v1-Fake-Speech-Detection-using-Frequncy-analysis Hope it makes it clearer.

CO18325 commented 3 years ago

Hey Inderpreet, I first downloaded each audio file (one hour longish) separately. Then I manually snipped out multiple 5 second audio snips, where there was noticeably less noise (such as audience cheer, wind, noise feedback), and the snippet contained only the speakers voice for the majority part. Though there is no restriction whether this snippet should be 5 second long only, it can be of any duration, since my goal is to convert them into frequency plots anyways (frequency plots do not care about the time duration of the input). I collected about 200 such audio snippets ( including both classes) among all the audio files that I had initially downloaded.

And then I converted each of these audio snippets into a frequency plot (the code is included in the repository). Then I manually cropped the images above 9KHz (because audio engineers usually set a high cut on the frequency well above 9Khz, and this varies according to the surroundings, so to make it all uniform, I simply ignore any frequencies above 9Khz. The natural voice of humans is anyways well under 9KHz)

See the final processed dataset here: https://github.com/PalaashAgrawal/v1-Fake-Speech-Detection-using-Frequncy-analysis Hope it makes it clearer.

Okay, thanks mate!! It is a great implementation logic. And have you thought about any logic to for automated extraction of the Dataset images?

PalaashAgrawal commented 3 years ago

I haven't written any code to do that, but its very much possible, even though it would be prone to some noise, because you can't really determine the quality of an audio just by analysing the frequency plot - listening to an audio is still the best way to understand the quality. I used the aforementioned approach, because I have some prior knowledge about sound processing and engineering.

But eitherways, you may try something like a Kalman filter based approach, which to be honest, even I would have to research and study about. And its a complex idea, but nonetheless possible.

CO18325 commented 3 years ago

I haven't written any code to do that, but its very much possible, even though it would be prone to some noise, because you can't really determine the quality of an audio just by analysing the frequency plot - listening to an audio is still the best way to understand the quality. I used the aforementioned approach, because I have some prior knowledge about sound processing and engineering.

But eitherways, you may try something like a Kalman filter based approach, which to be honest, even I would have to research and study about. And its a complex idea, but nonetheless possible.

Okay, I will study about Kalman Filter Approach. Thank you so much for your help!!