idea: corpus 'recipes' - Githubissues

We obviously cannot easily redistribute most corpora that underlie our research papers. However, given the hours we spent tracking down corpora available for research purposes, and the work @liesenf put in preparing so many of these for use, I was wondering whether we might be able to share 'recipes' for a subset of corpora.

Given a corpus, a recipe would specify:

where to find the corpus
the names of the files to be downloaded (quality-checked by us as including mainly conversation)
which tiers hold the key turn-level annotations (for eaf files)
which special tags were converted to unified [unknown], [laugher], [breath] tags
which special transcription conventions or characters were converted to yield utterance
...more? (any special transformational steps needed?)

This would amount to a regularised version of the corpus-specific sets of code that already exist. It would be a nice service for scikit-talk users as it would help them kickstart the process of building a corpus (or collection of corpora) that is similar in format to the kind we use in our research. I imagine it would especially be useful for some widely used corpora (IFADV, CallHome, CGN?).

This is not a need to have, more a nice to have, but I'm posting it here just to keep track of the idea and to garner your thoughts.

elpaco-escience / scikit-talk

idea: corpus 'recipes' #11