SpeechColab / GigaSpeech2

An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
Apache License 2.0
107 stars 5 forks source link

GigaSpeech 2

arXiv hf GitHub demo

This is the official repository of the GigaSpeech 2 dataset. For details of how we created the dataset, please refer to our arXiv preprint paper.

GigaSpeech 2 version: 2.0 (2024/06/19)

Download

Leaderboard

Contributor Toolkit Train Recipe Train Data Inference Test CER/WER
Baseline Icefall Zipformer/Stateless pruned RNN-T GigaSpeech 2.0 th TODO 12.46
Baseline Icefall Zipformer/Stateless pruned RNN-T GigaSpeech 2.0 id TODO 14.92
Baseline Icefall Zipformer/Stateless pruned RNN-T GigaSpeech 2.0 vi TODO 12.83
Baseline ESPNet Conformer/Transformer CTC/AED GigaSpeech 2.0 th TODO 13.70
Baseline ESPNet Conformer/Transformer CTC/AED GigaSpeech 2.0 id TODO 15.50
Baseline ESPNet Conformer/Transformer CTC/AED GigaSpeech 2.0 vi TODO 15.60

Dataset

Audio Source

Training Subsets

Thai (hours) Indonesian (hours) Vietnamese (hours)
GigaSpeech 2 raw 12901.8 8112.9 7324.0
GigaSpeech 2 refined 10262.0 5714.0 6039.0

GigaSpeech 2 raw contains all the data from GigaSpeech 2 refined.

Evaluation Subsets

Thai (hours) Indonesian (hours) Vietnamese (hours)
GigaSpeech 2 DEV 10.0 10.0 10.2
GigaSpeech 2 TEST 10.0 10.0 11.0

Evaluation subsets are annotated by professional human annotators.

Preparation Scripts

Soon available at Lhotse and ESPNet.

Metadata Walkthrough

Soon available.

Audio Processing

GigaSpeech 2 audio files are resampled to 16 kHz and converted to single-channel WAV format. For detailed implementation, refer to pipeline/convert_transcribe/convert_and_transcribe.py.

Text Pre-Processing

Transcripts are normalized by applying NFKC, converting all characters to uppercase, removing punctuation, and mapping Arabic numerals to words in the respective languages.

Text Post-Processing (before scoring)

We standardize by applying NFKC, converting all characters to uppercase, removing punctuation, and merging consecutive whitespace or removing all whitespace from both hypothesis and reference text before CER/WER scoring to ensure apple-to-apple performance comparisons across different toolkits or commercial services.

We also provide the following code snippet, which is used in all the experiments reported in our paper and leaderboard.

import string
import unicodedata

def text_post_processing(text):
    text = unicodedata.normalize("NFKC", text)  # apply NFKC
    text = text.upper()  # convert to uppercase
    text = text.replace("-", " ")  # remove hyphen
    text = re.sub("[{}]".format(string.punctuation), "", text)  # remove punctuation
    text = re.sub(r"\s+", "", text).strip()  # remove all whitespace for Thai
    return text

Collaboration

We are a group of volunteers trying to make speech technologies easier to use. We welcome any kind of contributions. Currently, we are exploring the following directions. If you are interested in one of the directions and you think you will be able to help, please contact gigaspeech@speechcolab.org.

Institutional Contributors

Institution Contribution
Shanghai Jiao Tong University Computing power; Data host; Researchers
The Chinese University of Hong Kong Researchers
Tsinghua University Researchers
Seasalt AI Researchers
Birch AI Researchers
Peng Cheng Laboratory Computing power
Dataocean AI Evaluation data annotation

Citation

Please cite our paper if you find this work useful:

@article{gigaspeech2,
  title={GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement},
  author={Yifan Yang and Zheshu Song and Jianheng Zhuo and Mingyu Cui and Jinpeng Li and Bo Yang and Yexing Du and Ziyang Ma and Xunying Liu and Ziyuan Wang and Ke Li and Shuai Fan and Kai Yu and Wei-Qiang Zhang and Guoguo Chen and Xie Chen},
  journal={arXiv preprint arXiv:2406.11546},
  year={2024},
}

Contact

If you have any concerns, please contact gigaspeech@speechcolab.org.

If you have any technical problems, please contact yifanyeung@sjtu.edu.cn.

Metadata Changelog