Open klh5 opened 2 weeks ago
Particularly interested in conversational style, and noisy/messy data (lots of filler words, differences in transcription etc.).
Also worth investigating datasets used as part of the MTQE project
I've tried to summarize the existing datasets from the literature here.
None of these fulfil all of our requirements. The WMT dataset provides a "gold standard" human score but no reference translation.
DISCO could provide an easy way to show the effect of different disfluency types. The dataset as distributed does not provide translations of the original imperfect speech, only fluent English translations, so we would need to pick a translation model to produce these.
Are there any existing datasets derived from human speech we could use?