JamesPHoughton commented 5 months ago

We want to explore the various options for analyzing the audio and video recordings. I've listed a number of different possible measures below. For each of these measures:

Google around and see if this is something people commonly measure
See if there are any existing packages in any language for doing that, and note:
- their language
- when they were last updated
- are they referenced in any academic papers
- are those academic papers cited by anyone else? How many times?

We also want to know if there are other measures that are commonly used.

Put outputs in this spreadsheet: https://docs.google.com/spreadsheets/d/1IIvfpwrBdvJ2Szd1jCwfTHPZvTkIB2DGHqYHbniyMwA/edit#gid=0

Audio analysis

Volume analysis

talk start and stop timings?
total talk time
number of turns
length of turn
loudness

Pitch

pitch variance?

Transcript analysis

speech rate (words per minute)
how many questions did they ask?

Video analysis

closeness to screen
expression
hand gestures
looking at screen

First steps

[ ] complete the spreadsheet rows for the tools you've tried already (whisperX, etc)
[ ] add any columns that you think need to be there that are not there
[ ] pin @JamesPHoughton for feedback

JamesPHoughton commented 3 months ago

Notes from Feb 5 meeting

shift Deepgram and volume over by the tstart
update label order on images
line for bottom part of this chart that has "is the volume above some threshold"
create a dataframe that has 1024 rows, columns are the different methods "

pseudocode

entry: { start: 17834 stop: 327849 word: 'mike' }

for col in [whisper, whisperX, deepgram]
  for i in range(1024):
    for entry in entries from that particular tool:
       if entry.start < time[i] && entry.stop >= time[i]  then true, break out of loop.
    otherwise false

Lower priority

try webm with deepgram, not converted wav file (https://developers.deepgram.com/docs/supported-audio-formats)
use ffmpeg to convert to wav file? (https://stackoverflow.com/questions/62064665/coverting-webm-to-wav-with-ffmpeg or https://superuser.com/questions/1327921/ffmpeg-convert-webm-opus-to-wav-or-flac-in-single-step)

mxumary commented 3 months ago

Look into: librosa visualization using wav file, figure out how to use getStart to adjust the shift for volume visualization (look at pseudocode for the following information) There are 1024 frames. For each frame, check if words fall within that time range -- the time series has a start and end time, check if the entry Make a df of 1024 length Each column will be a binary indicator for each software, indicating if it "agrees' with the volume

JamesPHoughton commented 3 months ago

@mxumary just going to make my clarification here instead of slack so that we have it later for reference. =)

What i'd eventually like is to have a matrix where rows are timestamps throughout the discussion - you can break it into 1024 intervals, as you did with librosa, or (probably more intuitively) some interval that corresponds to half or a quarter second. You can pick whatever is easiest to implement. Then I'd like the columns to be the different algorithms we've tried (volume method, deepgram, whisperX, whisper, stableTS) and the cells should be filled in with a 1 if the column's method predicts that the participant is speaking during the row's interval, otherwise zero.

With this matrix, we can look at the correlation between the different services.

mxumary commented 3 months ago

TODO (ideally by Wednesday!!!!): screenshot all of the plots, make comments to provide contexts. Add the colab notebook(s) to the repo -- documentation that converts audio to text, documentation that converts the table to data visualizations and correlation plots.

See colab notebook here: https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb

Make sure that the rolling window is centered make two new columns in the correlation: one with the current volume rolling average, another with a "centered" rolling average

This documentation shows how you can make a centered rolling average.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

The following image shows the cases that we need to cover when it comes to the correlation plot. The current code covers all of these cases.

FOR LATER: try to find a solution using `ffmpeg that can stitch videos together. For example, if there is a person who dropped the call and then rejoined, we want to stitch the audio so that it sounds like a cohesive audio file.

mxumary commented 3 months ago

Overview of the type of visualizations created

NOTE: specific code can be found here https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb

One software, all speakers: done at the word level and segment/utterance level

This helps identify if there is an overlap among speakers.

One speaker, all software, volume, and threshold visualizations

Identifies potential discrepancies in transcription between software and compares it to volume and "silence" (which is a pre-defined threshold that can be changed by the user)

Correlation heatmap

Based on a pre-defined interval (here, it is 0.5 seconds), we check each software's transcription output to identify if they indicate if there were any words spoken in that particular threshold. If so, we mark it as a 1 in the matrix, 0 otherwise. Next, we average volume (including rolling averages) over each interval. Below, we find the correlations to identify which software/volumes align with each other the best.

Next steps

Figure out what software we want to use: deepgram, stable-ts, WhisperX? (Whisper does not have word-level transcription, at least as of February 2024). Personally, I lean towards deepgram. It's faster and more precise than WhisperX and stable-ts. There is some lack of precision, though. The duration of one words (e.g., "Yes.") is labeled as having a very high duration ~7+ seconds, which isn't very realistic. But the tradeoff between having precise durations vs. fast transcription is worth it in my opinion.

Watts-Lab / deliberation-empirica

Explore video analysis options #695

Audio analysis

Volume analysis

Pitch

Transcript analysis

Video analysis

First steps

Notes from Feb 5 meeting

Lower priority

Overview of the type of visualizations created

One software, all speakers: done at the word level and segment/utterance level

One speaker, all software, volume, and threshold visualizations

Correlation heatmap

Next steps