Open JamesPHoughton opened 5 months ago
shift Deepgram and volume over by the tstart
update label order on images
line for bottom part of this chart that has "is the volume above some threshold"
create a dataframe that has 1024 rows, columns are the different methods "
pseudocode
entry: { start: 17834 stop: 327849 word: 'mike' }
for col in [whisper, whisperX, deepgram]
for i in range(1024):
for entry in entries from that particular tool:
if entry.start < time[i] && entry.stop >= time[i] then true, break out of loop.
otherwise false
Look into: librosa visualization using wav file, figure out how to use getStart
to adjust the shift for volume visualization
(look at pseudocode for the following information)
There are 1024 frames. For each frame, check if words fall within that time range -- the time series has a start and end time, check if the entry
Make a df of 1024 length
Each column will be a binary indicator for each software, indicating if it "agrees' with the volume
@mxumary just going to make my clarification here instead of slack so that we have it later for reference. =)
What i'd eventually like is to have a matrix where rows are timestamps throughout the discussion - you can break it into 1024 intervals, as you did with librosa, or (probably more intuitively) some interval that corresponds to half or a quarter second. You can pick whatever is easiest to implement. Then I'd like the columns to be the different algorithms we've tried (volume method, deepgram, whisperX, whisper, stableTS) and the cells should be filled in with a 1 if the column's method predicts that the participant is speaking during the row's interval, otherwise zero.
With this matrix, we can look at the correlation between the different services.
TODO (ideally by Wednesday!!!!): screenshot all of the plots, make comments to provide contexts. Add the colab notebook(s) to the repo -- documentation that converts audio to text, documentation that converts the table to data visualizations and correlation plots.
See colab notebook here: https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb
Make sure that the rolling window is centered make two new columns in the correlation: one with the current volume rolling average, another with a "centered" rolling average
This documentation shows how you can make a centered rolling average.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
The following image shows the cases that we need to cover when it comes to the correlation plot. The current code covers all of these cases.
FOR LATER: try to find a solution using `ffmpeg
that can stitch videos together. For example, if there is a person who dropped the call and then rejoined, we want to stitch the audio so that it sounds like a cohesive audio file.
NOTE: specific code can be found here https://github.com/Watts-Lab/deliberation-video-pipeline/blob/main/updated_vizzes_with_whisper%2C_whisperx_stable_ts%2C_deepgram.ipynb
This helps identify if there is an overlap among speakers.
Identifies potential discrepancies in transcription between software and compares it to volume and "silence" (which is a pre-defined threshold that can be changed by the user)
Based on a pre-defined interval (here, it is 0.5 seconds), we check each software's transcription output to identify if they indicate if there were any words spoken in that particular threshold. If so, we mark it as a 1 in the matrix, 0 otherwise. Next, we average volume (including rolling averages) over each interval. Below, we find the correlations to identify which software/volumes align with each other the best.
Figure out what software we want to use: deepgram, stable-ts, WhisperX? (Whisper does not have word-level transcription, at least as of February 2024). Personally, I lean towards deepgram. It's faster and more precise than WhisperX and stable-ts. There is some lack of precision, though. The duration of one words (e.g., "Yes.") is labeled as having a very high duration ~7+ seconds, which isn't very realistic. But the tradeoff between having precise durations vs. fast transcription is worth it in my opinion.
We want to explore the various options for analyzing the audio and video recordings. I've listed a number of different possible measures below. For each of these measures:
We also want to know if there are other measures that are commonly used.
Put outputs in this spreadsheet: https://docs.google.com/spreadsheets/d/1IIvfpwrBdvJ2Szd1jCwfTHPZvTkIB2DGHqYHbniyMwA/edit#gid=0
Audio analysis
Volume analysis
Pitch
Transcript analysis
Video analysis
First steps