Create conceptual replication of Liesenfeld & Dingemanse 2022

Conceptually replicating the analysis of our Interspeech paper is a useful goal to guide scikit-talk development. To that end I'm going to create a proof of concept of our code for identifying continuers and selecting a set of utterances that can then feed into audio clip extraction and clustering analysis. I'll be using the IFADV data so that it can be fully open.

I'll start work in the playground repo and will try to get to it within the next few days. From my side this will be based on our R code. For the audio extraction and clustering, it will be based on the code in the existing OSF repo. Both will require some degree of porting and editing to be made to work with the open IFADV data.

The paper:

Liesenfeld, Andreas, and Mark Dingemanse. 2022. “Bottom-up Discovery of Structure and Variation in Response Tokens (‘Backchannels’) across Diverse Languages.” In Proceeding of Interspeech 2022. https://doi.org/10.21437/Interspeech.2022-11288.

Alright @bvreede @n400peanuts @liesenf the playground repo now contains a first go at a dataset similar to the one that underlies the first half of our paper, but now using only the IFADV package.

The R code for generating this should be fairly straightforward to port to Python. I have tried to comment as needed. Let me know if you need any further guidance. To preview the steps:

We add a column streak that holds a streak counter using the cumsum() function. This counter increments whenever a speaker produces the same utterance in succession.
We select items that occur in streaks of >2: these are our candidate continuers.

Surprise! It so happens that in the exotic language of the IFADV dataset, the top three formats found in streaks are ja, ja ja, hum, as depicted in this quick and dirty convplot of a few sample sequences:

The selected utterances are in continuers_in_streaks.csv which looks like this:

 uid       language utterance utterance_stripped  begin    end participant
  <chr>     <chr>    <chr>     <chr>               <dbl>  <dbl> <chr>      
1 dutch-01… dutch    kch       kch                 56867  57227 spreker2 […
2 dutch-01… dutch    ja        ja                 421869 422277 spreker1 […
3 dutch-02… dutch    ja        ja                 121341 121579 spreker2 […
4 dutch-02… dutch    ja [unk_… ja                 124408 125008 spreker2 […
5 dutch-02… dutch    ja        ja                 137234 137505 spreker1 […
6 dutch-02… dutch    ja        ja                 141980 142252 spreker2 […

Next steps

In the other half of the paper, Andreas takes over, taking roughly the following steps (correct me if I'm wrong @liesenf):

Use the source column to identify corresponding audio files
Use the begin and end columns to identify positions at which to clip those audio files
Use ffmpeg (or similar) to extract audio clips
Generate spectrograms for all audio clips
Use UMAP from a fork of avgn to cluster audio clips

Which then ultimately leads to something like this (for Dutch only):

elpaco-escience / scikit-talk

Create conceptual replication of Liesenfeld & Dingemanse 2022 #24

Next steps