HPI-DeepLearning / crnn-lid

Code for the paper Language Identification Using Deep Convolutional Recurrent Neural Networks
GNU General Public License v3.0
105 stars 48 forks source link

wav_to_spectogram.py stops converting before it should #11

Closed ibro45 closed 5 years ago

ibro45 commented 5 years ago

Hi,

I'm working with four languages and for each I have downloaded only one video so that I can check that the scripts work as they should before running them on my VM on the cloud.

The issue I have is that the script _wav_tospectogram.py acts weird with one language. The languages and the number of segmented .wav file for each are:

So, the expected result after running the script is that there will be 38 or 39 .png spectograms for each language since the language with the least number of .wav files is French. It does execute as it should when I run it for all the languages except English:

w/o English

But running the script with English manages to count only 13 files in English, even though there are 42:

w/ English

I still haven't come up with an explanation to why it's happening, so any clue would be of a great help!

Here's the sources.yml that I used to download the videos if someone prefers to check it himself.

croatian:
  users:
    -
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdEh39boAuP-JPeDR7dy6wih

english:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdHSp1oIY4L_t5xX0dFV3GMH

french:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdEgT-oLhk11Xjbev7Q02F3-
spanish:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdHpKmps4DaomrqXd8VmZV1g

I'd also note that I'm working with 3-seconds segments, so if someone will be recreating what I am doing, it is important to change the number of seconds by which the files will be splitted. It is on the line 66 in _downloadyoutube.py from:

command = ["ffmpeg", "-y", "-i", f, "-map", "0", "-ac", "1", "-ar", "16000", "-f", "segment", "-segment_time", "10", output_filename]

to:

command = ["ffmpeg", "-y", "-i", f, "-map", "0", "-ac", "1", "-ar", "16000", "-f", "segment", "-segment_time", "3", output_filename]

For the same reason, it is necessary to change the size of the output spectogram on line 70 in _wav_tospectogram.py from:

parser.add_argument('--shape', dest='shape', default=[129, 500, 1], type=int, nargs=3)

to:

parser.add_argument('--shape', dest='shape', default=[129, 150, 1], type=int, nargs=3)

Thank you!

ibro45 commented 5 years ago

Since I have mentioned that I'm using 3-second segments, I'm interested in what do you think about increasing the pixel_per_second size from 50 to 100? Then I'd have 129x300x1 spectograms, which may result in the C(R)NN being able to detect patterns easier, isn't it? I'm still a newbie in this, sorry!

Bartzi commented 5 years ago

Hmm, interesting behaviour... can it be that your english samples contain lots of silence? Have a look at this line of code. Everything that contains silence is just skipped.

To your second question: That depends on the size of the actual regions in the voice samples. It could get better, but it might also not help... You might need to incease the size of the receptive field for the network in order to capture meaningful features. But it is worth a try =)

ibro45 commented 5 years ago

I've checked the samples, they seem to be alright. I also tried commenting out the two lines that ignore samples containing lots of silence and the same behaviour was repeated.

I also tried it on my whole dataset. The first output is the output after segmentation of the files, which tells how many of them there are. The second output is the _wav_tospectogram.py's output, as you can see, the same thing happened once again.

The output

And thanks for the advice regarding the 3-second segments! :)

Bartzi commented 5 years ago

I really don't know what the problem is... the iterator definitely stops when working on english, because of some reasons... but I'm afraid I can not help you further from this end without access to the data...

ibro45 commented 5 years ago

Thanks for replying! If you're interested in taking a look at it, I have included the sources.yml's content in the initial post. Each playlist contains just a video per each language whose purpose was testing that everything behaves as it should before running it on the cloud, so it's not going to be a trouble downloading the data.

ibro45 commented 5 years ago

I seem to have figured out what was happening.

It isn't an isolated problem for these English samples I used. I eventually got rid of them from my big dataset and tried running the wav_to_spectrogram again and the same thing happened with French.

Basically, when the SpectrogramGenerator is run, those segmented files are turned into spectrograms by Sox. What happens there is that, since it calculates the width based on -X (capital X) parameter, which is the pixels per second parameter, it sometimes, for reasons unknown to me, outputs wrong dimension - instead of [129, 150, 1] it does [129, 149, 1]. (Note that I'm using 3-second segments and 50 pixel per second)

Therefore, I tried adding the -x (small x) parameter which defines the overall size of the width at this line and the appropriate value for it.

It seems to have solved the issue, but I wonder what's your comment on this. If that's fine, I'll make a pull request.

Thanks!

Bartzi commented 5 years ago

Hmm,

interesting problem. I'm not sure but reading the manual page of Sox, it seems that -x only sets the maximum width of spectogram. But all in all that should not be a problem, since the audio snippets should always have the same length, so I would be very happy to have a look at a nice PR :smile: