ELITR / elitr-testset

ELITR collection of test sets, for ASR, MT and SLT
3 stars 12 forks source link

Czech ASR transcripts are of low quality #25

Open obo opened 2 years ago

obo commented 2 years ago

I just spotted very many mistakes in https://github.com/ELITR/elitr-testset/blob/master/documents/czech-asr/comp_linguistics/comp_linguistics.cs.OSt, e.g. letter casing everywhere but also terms like BLEU or even regular Czech words.

obo commented 2 years ago

Another horrible "golden" transcript: https://github.com/ELITR/elitr-testset/blame/master/documents/2021-theaitre-related/robothon-debate/robothon-debate.cs.OSt

obo commented 2 years ago

@Rishu, could you pass this to some annotators?

obo commented 2 years ago

Further files to be fixed (paths from https://github.com/ELITR/elitr-testset/blob/master/documents/): ./czech-asr/snemovna/snemovna.cs.OSt ./czech-asr/comp_linguistics/comp_linguistics.cs.OSt ./czech-asr/rozhlas/rozhlas.cs.OSt ./czech-asr/europarliament/europarliament.cs.OSt ./czech-asr/wgvat/wgvat.cs.OSt

obo commented 2 years ago

What is the status of this? I see that https://github.com/ELITR/elitr-testset/blob/master/documents/2021-theaitre-related/robothon-debate/robothon-debate.cs.OSt is still horrible. Is it being processed?