Closed SamahZaro closed 5 years ago
How long does it take for you? Takes around 6-7 minutes for me (and my connection is ~45Mbps down). It's only 61MB, but the scripts also try to check each file to see if they have speech (using webrtcvad), and in some cases it needs to convert them from mpeg-4 to wav as well.
My download speed isn't that good. Anyway, wouldn't it be good to make a compressed version available? containing preprocessed .wav or .flac files?
The checking and conversion are what taking time. It took more than 00:15 assuming I am going to download the whole Quran, will it take ~ 600 * 00:15 minutes?!
The checking and conversion are what taking time. It took more than 00:15 assuming I am going to download the whole Quran, will it take ~ 600 * 00:15 minutes?!
I agree, it does take quite a long time. I think at the moment it's not such a big issue, since we don't need to use the entire dataset when figuring out initial models and architectures.
My download speed isn't that good. Anyway, wouldn't it be good to make a compressed version available? containing preprocessed .wav or .flac files?
There was an idea floating around a while ago of uploading all the preprocessed MFCCs to an amazon bucket (which would speed things up), but I don't think that's be done yet.
There was an idea floating around a while ago of uploading all the preprocessed MFCCs to an amazon bucket (which would speed things up), but I don't think that's be done yet.
Good to know.
Just in case this can help someone else, I have provided the entire auto-evaluated Tarteel.v1 dataset on my drive in compressed version (Surah by Surah).
The dataset contains audio files converted to valid wave, with VAD removed. This was done using Google Colab, and it took ~7hours.
Wow, thank you very much for doing this, that's very useful. 7 hours, not too bad!
@SamahZaro could you please share your colab notebook with us? Or open a PR with your code if you'd like?
JAK!
I was trying to download and preprocess Al-Fatiha. Here my commands:
I applied this commit to fix invalid wave header issue and make it download. However, it took long time for one short surah! Is this normal?