Issues with Speaker Separation with Labels and Speech Characteristics

reemTamimi commented 2 months ago

I was able to run the Speech Transcription with WhisperX, where a json and text file were outputted. Now, I want to use the Speaker Separation with Labels (because I am using WhisperX), and that runs, but does not seem to output any audio files that are separated by speaker. I define the path for the output when using the to_audio function after creating the dict, but nothing happens and I do not receive an error.

Also, for the Speech Characteristics feature, your sample code says it needs "transcript_json" as it flag, but when I try to run that I get an error, and ultimately concluded that it needs "json_conf" as a flag, which I tried using by describing the path to my output json file, but it errors out and says that 'str' object does not support item assignment (image below).

Let me know if you need me to provide any more information

GeorgeEfstathiadis commented 2 months ago

Hi Reem, thank you for pointing out the discrepancy in our docs. I've gone ahead and resolved that in the docs and indeed the argument the Speech Characteristics function requires is json_conf.

So I believe the issue here in both Speaker Separation and Speech Characteristics is that the transcription JSON required as input is not the file path to the JSON, but the actual JSON loaded into memory as a dictionary.For example instead of setting json_conf in speech_characteristics to the filepath do this instead:

import json
import openwillis as ow

filepath = '' # JSON transcription location

with open(filepath) as f:
    json_conf = json.load(f)

# pass json_conf to function e.g.
words, turns, summary = ow.speech_characteristics(json_conf, language='en', speaker_label='speaker_1')

reemTamimi commented 2 months ago

I tested that out and that works! However, the speaker separation with labels function still returns an empty dictionary, hence no output audio file.

GeorgeEfstathiadis commented 2 months ago

Can you provide me the exact code you run for speaker separation with labels to identify the issue. Just to show you an example use case, to ensure your code is structured the same way:

import openwillis as ow
import json

json_filepath = '' # JSON transcription location
audio_filepath = '' # audio file location

with open(filepath, 'r') as f:
    transcript_json = json.load(f)

speaker_dict = ow.speaker_separation_labels(filepath = audio_filepath, transcript_json = transcript_json)
# then to save the files
ow.to_audio(filepath = audio_filepath, speaker_dict = speaker_dict, output_dir = 'path/to/save/audios')

reemTamimi commented 2 months ago

I see, I did not have the json load part. I added that and my code looks like this:

filepath = 'json/3002_Interview_Recording_LG.json'

with open(filepath) as f:
    transcript_json = json.load(f)
speaker_dict = ow.speaker_separation_labels(filepath='audio/3002_Interview_Recording_LG.mp3', transcript_json=transcript_json)

ow.to_audio(filepath='audio/3002_Interview_Recording_LG.mp3', speaker_dict=speaker_dict, output_dir='speaker_separate/')

but now I am receiving this error:

GeorgeEfstathiadis commented 2 months ago

The issue is with the audio file format. Currently the function only works with wav audio files. You can easily convert your mp3 to wav using a variety of tools and there are also some free online tools (e.g. https://cloudconvert.com/mp3-to-wav).

reemTamimi commented 2 months ago

I converted it to .wav and that worked! I have one more question. I am using the WhisperX speech transcription, and it is doing a great job, except for the few words throughout that it misinterprets, probably due to our interviews being with children. I was planning to go back and manually check audio files and text files to make sure the correct words are detected. How does this change the metrics in the json file? Would I simply change the word being detected?

GeorgeEfstathiadis commented 2 months ago

If you want to manually edit the JSON file extracted from the WhisperX transcription you can definitely do that, although we don't recommend it. To edit the transcribed words, you would need to go into the segments section of the JSON and then the words subsection of each segment.

If you are talking about only editing the transcribed word, e.g. 'for' instead of 'from', there are a few things to consider. If your language is not English, the only affected measures in Speech Characteristics will be those related to speech coherence. If your language is English, then the affected measures will be those related to speech coherence and those related to sentiment and part of speech measures.

reemTamimi commented 2 months ago

I understand. It is in English, however, children tend to not speak so clearly, so that is probably why there are some discrepancies. Would you recommend to maybe only change the text transcription, and keep the json file as is? In order to maintain the speech characteristics?

GeorgeEfstathiadis commented 2 months ago

I'm not sure what you refer to with 'text transcription'. I suggest not editing the JSON file manually; if the discrepancies are small the difference in the Speech Characteristics will be negligible.

But if you want to edit the JSON file to improve the accuracy of your analysis, you can do so by following these steps:

If you want to manually edit the JSON file extracted from the WhisperX transcription you can definitely do that, although we don't recommend it. To edit the transcribed words, you would need to go into the segments section of the JSON and then the words subsection of each segment.

reemTamimi commented 2 months ago

Thank you, I understand the aspects of editing the JSON file. I mean the text transcription that is essentially the text file that is one of the outputs from the whisperX speech transcription function.

My overall concern is how to deal with speech transcription discrepancies to get the most accurate metrics, pertaining to the audio file.

anzarabbas commented 2 months ago

@reemTamimi my understanding is that you can indeed edit the words in the JSON if you're willing to do so. I agree with @GeorgeEfstathiadis that - depending on how many errors you're seeing in there - it may not have a huge impact on downstream analysis. It's important to note that a lot of the measures in the speech characteristics function are not dependent on the language content itself, rather the timings around it (e.g. word start time / end time), so most of the measures won't be affected by you fixing the transcription. There are also a set of language-dependent measures (e.g. sentiment, the semantic ones) that are indeed affected by transcription accuracy. If you think that those measures would benefit from you fixing transcription errors, then yeah you can definitely do it. @GeorgeEfstathiadis any thoughts?

GeorgeEfstathiadis commented 2 months ago

No thoughts that covers it.

But just to clarify @reemTamimi the 'transcription text' output of the whisperx transcription is there only for your own benefit to understand the text transcribed and whether the performance is decent. It's not an actual input to the speech characteristics function, so editing that won't influence your downstream analysis.

reemTamimi commented 2 months ago

Thank you both for your input. I will play around to see what I benefit from the most.

reemTamimi commented 2 months ago

I am experiencing some discrepancies in the Speaker Separation with Labels function. The json file is relatively accurate in identifying the speakers. However, when I plug into the speaker separation function, the output audio files are not accurate. A lot of audio is missing, or is misidentified in terms of speaker. I also had a case where there were three speakers in the original audio, but the function only returned the first two speakers it recognized. I would like to use this function successfully so that I can make a separate audio file of the interviewee to then run vocal acoustics on, without having to do it manually.

GeorgeEfstathiadis commented 2 months ago

In terms of speaker separation with labels performance, there's not much we can do from our end. The function simply uses the timestamps identified by the transcription JSON to filter the audio. If the transcription JSON is not very good, then the separated audio files won't be very good.

In terms of audios with more than 2 speakers, we don't support this in the current version of the function. Speaker separation only works for transcriptions with 2 speakers.

bklynhlth / openwillis

Issues with Speaker Separation with Labels and Speech Characteristics #114