MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.45k stars 291 forks source link

Json output? #132

Closed vladgrand2 closed 5 months ago

vladgrand2 commented 10 months ago

It is possible to add to diarization.py parametr to get json file output with speakers like in whisperx?

bap-development commented 10 months ago

Need help on this as well, it will be good if the result can be converted to json rather than txt file

MahmoudAshraf97 commented 10 months ago

Can you post an example JSON so I can replicate the scheme?

vladgrand2 commented 10 months ago

test (whisperx_output).json test(after script).json

I write temporary code to convert srt files from diarize.py to json like a whisper which I need for work. further injecting.


import json
import sys

def convert_srt_to_json(srt_file):
    segments = []
    speaker = ''
    words = [{}]

    with open(srt_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

        for i in range(len(lines)):
            line = lines[i].strip()

            if line.isdigit():
                start, end = lines[i+1].strip().split(' --> ')
                text = lines[i+2].strip()

                # Проверяем, содержит ли текст "SPEAKER_01:" или "SPEAKER_00:"
                if 'Speaker 1:' in text:
                    speaker = 'SPEAKER_01'
                    text = text.replace('Speaker 1:', '').strip()
                elif 'Speaker 0:' in text:
                    speaker = 'SPEAKER_00'
                    text = text.replace('Speaker 0:', '').strip()

                def convert_time_to_seconds(time):
                    h, m, s = time.split(':')
                    s, ms = s.split(',')
                    seconds = int(h) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000
                    return seconds

                start_seconds = convert_time_to_seconds(start)
                end_seconds = convert_time_to_seconds(end)

                segments.append({
                    'start': start_seconds,
                    'end': end_seconds,
                    'text': text,
                    'words': words,
                    'speaker': speaker
                })

    json_data = {
        'segments': segments
    }

    json_file = srt_file.replace('.srt', '.json')

    with open(json_file, 'w', encoding='utf-8') as file:
        json.dump(json_data, file, indent=4, ensure_ascii=False)

    print(f"Successfully converted SRT to JSON: {json_file}")

srt_file = sys.argv[1]
convert_srt_to_json(srt_file)

also I reccoment to add to diarize.py:

parser.add_argument(
    "--language",
    dest="language",
    default=None,
    help="language spoken in the audio, specify None to perform language detection",
)

This is need for correct language transcribation. by default whisperx trying to detect language and in many cases ruins transcribation.

Also I noticed that in Windows diarize.py don't want to work with large-v3 but in Linux everything ok! Original whisperx work with large-v3 after update fine in Windows.