facebookresearch / voxpopuli

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
Other
510 stars 51 forks source link

Croatian ASR data missing half of raw transcripts and characters with diacritics #36

Open nljubesi opened 2 years ago

nljubesi commented 2 years ago

I downloaded the Croatian ASR data and noticed two significant issues that I want to report:

You are probably aware of the third issue that your normalisation is English-based (Plomin 3 normalised to plomin three while it should be plomin tri) so not producing useful results for Croatian in most cases.

I am very much open for further clarifications. Or help. (@5roop)

nljubesi commented 2 years ago

The data issue is also seen in the output of the model fine-tuned on the dataset.

Model: https://huggingface.co/facebook/wav2vec2-base-10k-voxpopuli-ft-hr

Rather hard speech sample: "da se vratimo korak u na inflacija odlikaa kad ste ve spomenuli roditelji neje uitelji isto vie polako i odustaju klincima odgovara status ko"

The odlikaa (should be odlikaša), neje (should be neće), uitelji (should be učitelji), vie (should be više) shows that the data bug is very much visible in the model output as well.

Just for comparison, another model, fine-tuned on a part of parlaspeech-hr (http://hdl.handle.net/11356/1494) https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm produces this: "da se vratimo korak unazad inflacija od likaša kad ste već spomenuli roditelji neće učitelji isto više polako i odustaju klincima odgovara status to".