elpaco-escience / scikit-talk

Scikit-talk is an open-source toolkit for processing collections of real-world conversational speech in Python. The toolkit aims to facilitate the exploration of large collections of transcriptions and annotations of conversational interaction.
Apache License 2.0
2 stars 0 forks source link

idea: corpus 'recipes' #11

Open mdingemanse opened 1 year ago

mdingemanse commented 1 year ago

We obviously cannot easily redistribute most corpora that underlie our research papers. However, given the hours we spent tracking down corpora available for research purposes, and the work @liesenf put in preparing so many of these for use, I was wondering whether we might be able to share 'recipes' for a subset of corpora.

Given a corpus, a recipe would specify:

  1. where to find the corpus
  2. the names of the files to be downloaded (quality-checked by us as including mainly conversation)
  3. which tiers hold the key turn-level annotations (for eaf files)
  4. which special tags were converted to unified [unknown], [laugher], [breath] tags
  5. which special transcription conventions or characters were converted to yield utterance
  6. ...more? (any special transformational steps needed?)

This would amount to a regularised version of the corpus-specific sets of code that already exist. It would be a nice service for scikit-talk users as it would help them kickstart the process of building a corpus (or collection of corpora) that is similar in format to the kind we use in our research. I imagine it would especially be useful for some widely used corpora (IFADV, CallHome, CGN?).

This is not a need to have, more a nice to have, but I'm posting it here just to keep track of the idea and to garner your thoughts.

liesenf commented 1 year ago

This is an excellent idea. We can publish a short notebook per corpus that makes it easy to load, parse, and clean a given dataset. This way the user can assemble multilingual datasets in no time.