alan-turing-institute / ARC-m4st

Evaluating metrics for speech translation
MIT License
0 stars 0 forks source link

Research existing datasets #2

Open klh5 opened 2 weeks ago

klh5 commented 2 weeks ago

Are there any existing datasets derived from human speech we could use?

jack89roberts commented 2 weeks ago

Particularly interested in conversational style, and noisy/messy data (lots of filler words, differences in transcription etc.).

klh5 commented 1 week ago

Also worth investigating datasets used as part of the MTQE project

klh5 commented 1 week ago

Other datasets which could be useful include those used for disfluency detection, in particular Disfl-QA which includes both an original question (derived from SQuADv2) and a human-altered question containing added disfluencies.

klh5 commented 6 days ago

I've tried to summarize the existing datasets from the literature here.

None of these fulfil all of our requirements. The WMT dataset provides a "gold standard" human score but no reference translation.

DISCO could provide an easy way to show the effect of different disfluency types. The dataset as distributed does not provide translations of the original imperfect speech, only fluent English translations, so we would need to pick a translation model to produce these.