Watts-Lab / deliberation-empirica

Empirica V2 framework
MIT License
5 stars 0 forks source link

Explore video analysis options #695

Open JamesPHoughton opened 5 months ago

JamesPHoughton commented 5 months ago

We want to explore the various options for analyzing the audio and video recordings. I've listed a number of different possible measures below. For each of these measures:

We also want to know if there are other measures that are commonly used.

Put outputs in this spreadsheet: https://docs.google.com/spreadsheets/d/1IIvfpwrBdvJ2Szd1jCwfTHPZvTkIB2DGHqYHbniyMwA/edit#gid=0

Audio analysis

Volume analysis

Pitch

Transcript analysis

Video analysis

First steps

JamesPHoughton commented 3 months ago

Notes from Feb 5 meeting

pseudocode

entry: { start: 17834 stop: 327849 word: 'mike' }

for col in [whisper, whisperX, deepgram]
  for i in range(1024):
    for entry in entries from that particular tool:
       if entry.start < time[i] && entry.stop >= time[i]  then true, break out of loop.
    otherwise false

Lower priority

mxumary commented 3 months ago

Look into: librosa visualization using wav file, figure out how to use getStart to adjust the shift for volume visualization (look at pseudocode for the following information) There are 1024 frames. For each frame, check if words fall within that time range -- the time series has a start and end time, check if the entry Make a df of 1024 length Each column will be a binary indicator for each software, indicating if it "agrees' with the volume

JamesPHoughton commented 3 months ago

@mxumary just going to make my clarification here instead of slack so that we have it later for reference. =)

What i'd eventually like is to have a matrix where rows are timestamps throughout the discussion - you can break it into 1024 intervals, as you did with librosa, or (probably more intuitively) some interval that corresponds to half or a quarter second. You can pick whatever is easiest to implement. Then I'd like the columns to be the different algorithms we've tried (volume method, deepgram, whisperX, whisper, stableTS) and the cells should be filled in with a 1 if the column's method predicts that the participant is speaking during the row's interval, otherwise zero.

With this matrix, we can look at the correlation between the different services.

mxumary commented 3 months ago

TODO (ideally by Wednesday!!!!): screenshot all of the plots, make comments to provide contexts. Add the colab notebook(s) to the repo -- documentation that converts audio to text, documentation that converts the table to data visualizations and correlation plots.

See colab notebook here: https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb

Make sure that the rolling window is centered make two new columns in the correlation: one with the current volume rolling average, another with a "centered" rolling average

This documentation shows how you can make a centered rolling average.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

The following image shows the cases that we need to cover when it comes to the correlation plot. The current code covers all of these cases.

image

FOR LATER: try to find a solution using `ffmpeg that can stitch videos together. For example, if there is a person who dropped the call and then rejoined, we want to stitch the audio so that it sounds like a cohesive audio file.

mxumary commented 3 months ago

Overview of the type of visualizations created

NOTE: specific code can be found here https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb

One software, all speakers: done at the word level and segment/utterance level

This helps identify if there is an overlap among speakers. image

One speaker, all software, volume, and threshold visualizations

Identifies potential discrepancies in transcription between software and compares it to volume and "silence" (which is a pre-defined threshold that can be changed by the user) image

Correlation heatmap

Based on a pre-defined interval (here, it is 0.5 seconds), we check each software's transcription output to identify if they indicate if there were any words spoken in that particular threshold. If so, we mark it as a 1 in the matrix, 0 otherwise. Next, we average volume (including rolling averages) over each interval. Below, we find the correlations to identify which software/volumes align with each other the best. image

Next steps

Figure out what software we want to use: deepgram, stable-ts, WhisperX? (Whisper does not have word-level transcription, at least as of February 2024). Personally, I lean towards deepgram. It's faster and more precise than WhisperX and stable-ts. There is some lack of precision, though. The duration of one words (e.g., "Yes.") is labeled as having a very high duration ~7+ seconds, which isn't very realistic. But the tradeoff between having precise durations vs. fast transcription is worth it in my opinion.