OpenPecha / stt_create_conversation_data

MIT License
0 stars 0 forks source link

STT0050: Creating conversation datas #1

Open gangagyatso4364 opened 1 month ago

gangagyatso4364 commented 1 month ago

Description :

We need to create conversation using Speaker diarisation and existing STT datas time stamps. from NS audios. Use existing speaker diarisation model from pyannote.audio: model i expect an output that is a json file:

{
  "conversations": [
    {
      "conversation_id": 1,
      "participants": ["Speaker One", "Speaker Two"],
      "dialogue": [
        {
          "speaker": "Speaker One",
          "text": "Hello, how are you?"
        },
        {
          "speaker": "Speaker Two",
          "text": "I'm good, thank you! How about you?"
        }
      ]
    },
    {
      "conversation_id": 2,
      "participants": ["Speaker One", "Speaker Two", "Speaker Three"],
      "dialogue": [
        {
          "speaker": "Speaker One",
          "text": "Are we meeting tomorrow?"
        },
        {
          "speaker": "Speaker Three",
          "text": "Yes, let's meet at 10 AM."
        },
        {
          "speaker": "Speaker Two",
          "text": "Sounds good to me."
        }
      ]
    }
  ]
}

Implementation:

  1. Extract speaker information from the audio using a speaker diarization model, which will provide time intervals and speaker identities.
  2. Align the speaker segments with STT transcriptions to assign the correct speaker to each text based on timestamp matching.
  3. Organize the speaker dialogues into a structured format, identifying participants in each conversation and compiling their dialogues sequentially.
  4. Output the conversation data into a JSON format where each conversation includes a unique conversation_id, the participants, and the dialogue.

Subtasks:

  1. Transcription Alignment:

    • [ ] Parse the existing STT transcriptions with timestamps.
    • [ ] Match transcription timestamps with speaker intervals from the diarization step.
    • [ ] Assign each transcription to the appropriate speaker.
  2. Data Structuring:

    • [ ] Organize the speaker and transcription data into conversation blocks.
    • [ ] Identify participants in each conversation.
    • [ ] Ensure proper sequencing of dialogues for a coherent conversation flow.
  3. JSON Output Generation:

    • [ ] Create a function to compile the conversations into a JSON structure as shown in the example.
    • [ ] Export the structured conversations into a JSON file.
gangagyatso4364 commented 1 month ago

The function used to get time span from stt file_name:

def get_time_span(filename):

    filename = filename.replace(".wav", "")
    filename = filename.replace(".WAV", "")
    filename = filename.replace(".mp3", "")
    filename = filename.replace(".MP3", "")
    try:
        if "_to_" in filename:
            start, end = filename.split("_to_")
            start = start.split("_")[-1]
            end = end.split("_")[0]
            end = float(end)
            start = float(start)
            return (end - start) / 1000
        else:
            start, end = filename.split("-")
            start = start.split("_")[-1]
            end = end.split("_")[0]
            end = float(end)
            start = float(start)
            return abs(end - start)
    except Exception as err:
        print(f"filename is:'{filename}'. Could not parse to get time span.")
        return 0