Scikit-talk is an open-source toolkit for processing collections of real-world conversational speech in Python. The toolkit aims to facilitate the exploration of large collections of transcriptions and annotations of conversational interaction.
We obviously cannot easily redistribute most corpora that underlie our research papers. However, given the hours we spent tracking down corpora available for research purposes, and the work @liesenf put in preparing so many of these for use, I was wondering whether we might be able to share 'recipes' for a subset of corpora.
Given a corpus, a recipe would specify:
where to find the corpus
the names of the files to be downloaded (quality-checked by us as including mainly conversation)
which tiers hold the key turn-level annotations (for eaf files)
which special tags were converted to unified [unknown], [laugher], [breath] tags
which special transcription conventions or characters were converted to yield utterance
...more? (any special transformational steps needed?)
This would amount to a regularised version of the corpus-specific sets of code that already exist. It would be a nice service for scikit-talk users as it would help them kickstart the process of building a corpus (or collection of corpora) that is similar in format to the kind we use in our research. I imagine it would especially be useful for some widely used corpora (IFADV, CallHome, CGN?).
This is not a need to have, more a nice to have, but I'm posting it here just to keep track of the idea and to garner your thoughts.
This is an excellent idea. We can publish a short notebook per corpus that makes it easy to load, parse, and clean a given dataset. This way the user can assemble multilingual datasets in no time.
We obviously cannot easily redistribute most corpora that underlie our research papers. However, given the hours we spent tracking down corpora available for research purposes, and the work @liesenf put in preparing so many of these for use, I was wondering whether we might be able to share 'recipes' for a subset of corpora.
Given a corpus, a recipe would specify:
utterance
This would amount to a regularised version of the corpus-specific sets of code that already exist. It would be a nice service for
scikit-talk
users as it would help them kickstart the process of building a corpus (or collection of corpora) that is similar in format to the kind we use in our research. I imagine it would especially be useful for some widely used corpora (IFADV, CallHome, CGN?).This is not a need to have, more a nice to have, but I'm posting it here just to keep track of the idea and to garner your thoughts.