TarteelAI / tarteel-ml

Pre-processing and training scripts for the Tarteel Dataset
MIT License
183 stars 53 forks source link

Downloading Surah Al-Fatihah only took long time #25

Closed SamahZaro closed 5 years ago

SamahZaro commented 5 years ago

I was trying to download and preprocess Al-Fatiha. Here my commands:

git clone https://github.com/Tarteel-io/Tarteel-ML.git
cd Tarteel-ML/
git cherry-pick 624c46b
conda env create -f environment.yml
conda activate tarteel-ml
python download.py -s 1

I applied this commit to fix invalid wave header issue and make it download. However, it took long time for one short surah! Is this normal?

image

AymenQ commented 5 years ago

How long does it take for you? Takes around 6-7 minutes for me (and my connection is ~45Mbps down). It's only 61MB, but the scripts also try to check each file to see if they have speech (using webrtcvad), and in some cases it needs to convert them from mpeg-4 to wav as well.

SamahZaro commented 5 years ago

My download speed isn't that good. Anyway, wouldn't it be good to make a compressed version available? containing preprocessed .wav or .flac files?

SamahZaro commented 5 years ago

The checking and conversion are what taking time. It took more than 00:15 assuming I am going to download the whole Quran, will it take ~ 600 * 00:15 minutes?!

AymenQ commented 5 years ago

The checking and conversion are what taking time. It took more than 00:15 assuming I am going to download the whole Quran, will it take ~ 600 * 00:15 minutes?!

I agree, it does take quite a long time. I think at the moment it's not such a big issue, since we don't need to use the entire dataset when figuring out initial models and architectures.

My download speed isn't that good. Anyway, wouldn't it be good to make a compressed version available? containing preprocessed .wav or .flac files?

There was an idea floating around a while ago of uploading all the preprocessed MFCCs to an amazon bucket (which would speed things up), but I don't think that's be done yet.

SamahZaro commented 5 years ago

There was an idea floating around a while ago of uploading all the preprocessed MFCCs to an amazon bucket (which would speed things up), but I don't think that's be done yet.

Good to know.

Just in case this can help someone else, I have provided the entire auto-evaluated Tarteel.v1 dataset on my drive in compressed version (Surah by Surah).

The dataset contains audio files converted to valid wave, with VAD removed. This was done using Google Colab, and it took ~7hours.

AymenQ commented 5 years ago

Wow, thank you very much for doing this, that's very useful. 7 hours, not too bad!

piraka9011 commented 5 years ago

@SamahZaro could you please share your colab notebook with us? Or open a PR with your code if you'd like?

JAK!