Open nljubesi opened 2 years ago
The data issue is also seen in the output of the model fine-tuned on the dataset.
Model: https://huggingface.co/facebook/wav2vec2-base-10k-voxpopuli-ft-hr
Rather hard speech sample: "da se vratimo korak u na inflacija odlikaa kad ste ve spomenuli roditelji neje uitelji isto vie polako i odustaju klincima odgovara status ko"
The odlikaa
(should be odlikaša
), neje
(should be neće
), uitelji
(should be učitelji
), vie
(should be više
) shows that the data bug is very much visible in the model output as well.
Just for comparison, another model, fine-tuned on a part of parlaspeech-hr (http://hdl.handle.net/11356/1494) https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm produces this: "da se vratimo korak unazad inflacija od likaša kad ste već spomenuli roditelji neće učitelji isto više polako i odustaju klincima odgovara status to".
I downloaded the Croatian ASR data and noticed two significant issues that I want to report:
raw_text
transcriptsnormalized_text
column have characters with diacritics missing, so not just diacritics, but whole characters (očito
written asoito
), making these transcripts mostly uselessYou are probably aware of the third issue that your normalisation is English-based (
Plomin 3
normalised toplomin three
while it should beplomin tri
) so not producing useful results for Croatian in most cases.I am very much open for further clarifications. Or help. (@5roop)