elpaco-escience / scikit-talk

Scikit-talk is an open-source toolkit for processing collections of real-world conversational speech in Python. The toolkit aims to facilitate the exploration of large collections of transcriptions and annotations of conversational interaction.
Apache License 2.0
2 stars 0 forks source link

Create conceptual replication of Liesenfeld & Dingemanse 2022 #24

Open mdingemanse opened 1 year ago

mdingemanse commented 1 year ago

Conceptually replicating the analysis of our Interspeech paper is a useful goal to guide scikit-talk development. To that end I'm going to create a proof of concept of our code for identifying continuers and selecting a set of utterances that can then feed into audio clip extraction and clustering analysis. I'll be using the IFADV data so that it can be fully open.

I'll start work in the playground repo and will try to get to it within the next few days. From my side this will be based on our R code. For the audio extraction and clustering, it will be based on the code in the existing OSF repo. Both will require some degree of porting and editing to be made to work with the open IFADV data.

The paper:

mdingemanse commented 1 year ago

Alright @bvreede @n400peanuts @liesenf the playground repo now contains a first go at a dataset similar to the one that underlies the first half of our paper, but now using only the IFADV package.

The R code for generating this should be fairly straightforward to port to Python. I have tried to comment as needed. Let me know if you need any further guidance. To preview the steps:

  1. We add a column streak that holds a streak counter using the cumsum() function. This counter increments whenever a speaker produces the same utterance in succession.
  2. We select items that occur in streaks of >2: these are our candidate continuers.

Surprise! It so happens that in the exotic language of the IFADV dataset, the top three formats found in streaks are ja, ja ja, hum, as depicted in this quick and dirty convplot of a few sample sequences:

image

The selected utterances are in continuers_in_streaks.csv which looks like this:

 uid       language utterance utterance_stripped  begin    end participant
  <chr>     <chr>    <chr>     <chr>               <dbl>  <dbl> <chr>      
1 dutch-01… dutch    kch       kch                 56867  57227 spreker2 […
2 dutch-01… dutch    ja        ja                 421869 422277 spreker1 […
3 dutch-02… dutch    ja        ja                 121341 121579 spreker2 […
4 dutch-02… dutch    ja [unk_… ja                 124408 125008 spreker2 […
5 dutch-02… dutch    ja        ja                 137234 137505 spreker1 […
6 dutch-02… dutch    ja        ja                 141980 142252 spreker2 […

Next steps

In the other half of the paper, Andreas takes over, taking roughly the following steps (correct me if I'm wrong @liesenf):

  1. Use the source column to identify corresponding audio files
  2. Use the begin and end columns to identify positions at which to clip those audio files
  3. Use ffmpeg (or similar) to extract audio clips
  4. Generate spectrograms for all audio clips
  5. Use UMAP from a fork of avgn to cluster audio clips

Which then ultimately leads to something like this (for Dutch only):

image