Croatian ASR data missing half of raw transcripts and characters with diacritics

facebookresearch / voxpopuli

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Other

510 stars 51 forks source link

The data issue is also seen in the output of the model fine-tuned on the dataset.

Model: https://huggingface.co/facebook/wav2vec2-base-10k-voxpopuli-ft-hr

Rather hard speech sample: "da se vratimo korak u na inflacija odlikaa kad ste ve spomenuli roditelji neje uitelji isto vie polako i odustaju klincima odgovara status ko"

The odlikaa (should be odlikaša), neje (should be neće), uitelji (should be učitelji), vie (should be više) shows that the data bug is very much visible in the model output as well.

Just for comparison, another model, fine-tuned on a part of parlaspeech-hr (http://hdl.handle.net/11356/1494) https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm produces this: "da se vratimo korak unazad inflacija od likaša kad ste već spomenuli roditelji neće učitelji isto više polako i odustaju klincima odgovara status to".

facebookresearch / voxpopuli

Croatian ASR data missing half of raw transcripts and characters with diacritics #36