Khaled-Abdelhamid / Mobeen

MIT License
0 stars 0 forks source link

Curate a list of good Arabic Datasets (potential candidates) #4

Open linear[bot] opened 1 month ago

linear[bot] commented 1 month ago

WHI-15 Curate a list of good Arabic Datasets (potential candidates)

MahmoudAshraf97 commented 1 month ago

Dataset Name: Annotated Al Jazeera Dialectal Speech Corpus Link: https://arbml.github.io/masader/card?id=15 Volume: 57 hours Dialect: Mixed Notes: Missing/Inaccessible

Dataset Name: Multi-Genre Broadcast (MGB-2) Link: https://arabicspeech.org/resources/mgb2 Volume: 1200 hours Dialect: Mixed Notes:

Dataset Name: Multi-Genre Broadcast (MGB-3) Link: https://arabicspeech.org/resources/mgb3 Volume: 15.8 hours Dialect: Egyptian Notes:

Dataset Name: Multi-Genre Broadcast (MGB-5) Link: https://arabicspeech.org/resources/mgb5 Volume: 14 hours Dialect: Moroccan Notes:

Dataset Name: QASR Link: https://arabicspeech.org/resources/qasr Volume: 2041 hours Dialect: Mixed Notes:

Dataset Name: ESCWA-CS Link: https://arabicspeech.org/resources/escwacs Volume: 2.8 hours Dialect: Mixed Notes:

Dataset Name: Common Voice Dataset Link: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/ Volume: 155.8 hours Dialect: Notes:

Dataset Name: MediaSpeech Link: https://huggingface.co/datasets/arbml/MediaSpeech_ar Volume: 10 hours Dialect: Notes:

Dataset Name: WAW Corpus Link: https://alt.qcri.org/resources/wawcorpus/ Volume: 0.5 hours Dialect: Notes: Audio files missing

Dataset Name: Arab-Andalusian music corpus Link: https://zenodo.org/records/1291776#.YqTFeHZBxD9 Volume: 125 hours Dialect: Notes:

Dataset Name: MASC: Massive Arabic Speech Corpus Link: https://huggingface.co/datasets/pain/MASC Volume: 1000 hours Dialect: Mixed Notes:

Dataset Name: QAC: Qatari Arabic Corpus Link: https://web.archive.org/web/20150918002143/http://sprosig.isle.illinois.edu/corpora/1 Volume: 18.5 hours Dialect: Qatari Notes: Dataset Missing

Dataset Name: ArabCeleb Link: https://github.com/CeLuigi/ArabCeleb Volume: Dialect: Notes:

Dataset Name: Quran Speech: Imam + Users Link: https://github.com/tarekeldeeb/DeepSpeech-Quran/tree/master/data/quran Volume: Dialect: Notes:

Dataset Name: SADA Link: https://www.kaggle.com/datasets/sdaiancai/sada2022 Volume: Dialect: Notes:

Dataset Name: 400K Egyptian Arabic Lines Link: https://www.kaggle.com/datasets/fadisarwat/egyptian-arabic-lines Volume: Dialect: Notes:

Dataset Name: ASR-EGARBCSC Link: https://magichub.com/datasets/egyptian-arabic-conversational-speech-corpus/ Volume: Dialect: Notes:

Dataset Name: SciSoundArabia Link: https://www.kaggle.com/datasets/ghalebaa/scisoundarabia Volume: Dialect: Notes:

Dataset Name: FLUERS Link: https://huggingface.co/datasets/google/fleurs/viewer/ar_eg Volume: Dialect: Egyptian Notes: