MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

Hyper Parameter Tuning for MMSD model #123

Closed v-nhandt21 closed 7 months ago

v-nhandt21 commented 8 months ago

image

image

The label for audio is segmented into many small chunk, how could I improve this phenomenon.

p/s: the speaker prediction is quite good

MahmoudAshraf97 commented 7 months ago

Can you elaborate more on what you are trying to achieve?

v-nhandt21 commented 7 months ago

Can you elaborate more on what you are trying to achieve?

image

I have tried to convert the bottom image to the image on the top, I have done it with

def concate_continues(rttm_file, new_rttm_file, time_distance=0.5):

     f = open(rttm_file, "r", encoding="utf-8")
     rttm_entries = f.read().splitlines()

     prev_speaker = None
     prev_end = None

     STACK = []

     for rttm_entry in rttm_entries:
          fields = rttm_entry.split()
          start_time = float(fields[3])
          end_time = start_time + float(fields[4])
          speaker_label = fields[7]

          if prev_speaker == speaker_label:
               if start_time - prev_end <= time_distance:
                    prev_segment = STACK[-1]
                    STACK = STACK[:-1]
                    prev_duration = float(prev_segment.split()[4])
                    prev_start = float(prev_segment.split()[3])

                    STACK.append(prev_segment.replace(str(prev_duration), str(end_time - prev_start)))
               else:
                    STACK.append(rttm_entry)
          else:
               STACK.append(rttm_entry)

          prev_speaker = speaker_label
          prev_end = end_time

     fw = open(new_rttm_file, "w+", encoding="utf-8")
     for s in STACK:
          fw.write(s + "\n")

     return new_rttm_file