MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.44k stars 238 forks source link

Output format #187

Open famda opened 1 month ago

famda commented 1 month ago

Hey! Awesome work on this!

Is it possible to transcript/diarize and get a json output as a result file? That would be a nice feature to have.

MahmoudAshraf97 commented 1 month ago

Thanks, it's possible yes, there's an example in one of the branches if you want to try it, but I haven't added it to the main branch because when it comes to JSON, everyone has their own scheme and a universal scheme won't cut it, but happy to hear your suggestions

famda commented 1 month ago

I understand. I think is just a matter of having structure on the response. Something that can be deserialized. I was also testing this which is kinda wrapper api around whisper. That API gives you the possibility of getting the type of format you want to receive (text, json, ...).

with the possibility of passing an argument like --output_format [json, srt, text, or whatever]

My idea was to have something like this (just a suggestion if it makes sense):

{
    "text": "Hi, my name is Test.",
    "speaker": "Speaker 0",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 5.4,
            "text": "Hi, my name is Test.",
            "tokens": [ 
                  double array
            ],
            "temperature": 0.0,
            "avg_logprob": -0.19734466075897217,
            "compression_ratio": 1.7903780068728523,
            "no_speech_prob": 0.1006949171423912,
            "words": [
                {
                    "word": " Hi,",
                    "start": 0.0,
                    "end": 0.64,
                    "probability": 0.7109836935997009
                },
                {
                    "word": " my",
                    "start": 0.88,
                    "end": 1.08,
                    "probability": 0.9681467413902283
                },
                {
                    "word": " name",
                    "start": 1.08,
                    "end": 1.22,
                    "probability": 0.9989060163497925
                },
                {
                    "word": " is",
                    "start": 1.22,
                    "end": 1.38,
                    "probability": 0.9960727691650391
                },
                {
                    "word": " Test.",
                    "start": 1.38,
                    "end": 1.62,
                    "probability": 0.8055099844932556
                }
            ]
        }
    ],
    "language": "en"
}

What do you think of this?

MahmoudAshraf97 commented 1 month ago

Sounds reasonable, I'll work on it when I have the time, or maype open a PR if possible 😁