Closed v-nhandt21 closed 7 months ago
Can you elaborate more on what you are trying to achieve?
Can you elaborate more on what you are trying to achieve?
I have tried to convert the bottom image to the image on the top, I have done it with
def concate_continues(rttm_file, new_rttm_file, time_distance=0.5):
f = open(rttm_file, "r", encoding="utf-8")
rttm_entries = f.read().splitlines()
prev_speaker = None
prev_end = None
STACK = []
for rttm_entry in rttm_entries:
fields = rttm_entry.split()
start_time = float(fields[3])
end_time = start_time + float(fields[4])
speaker_label = fields[7]
if prev_speaker == speaker_label:
if start_time - prev_end <= time_distance:
prev_segment = STACK[-1]
STACK = STACK[:-1]
prev_duration = float(prev_segment.split()[4])
prev_start = float(prev_segment.split()[3])
STACK.append(prev_segment.replace(str(prev_duration), str(end_time - prev_start)))
else:
STACK.append(rttm_entry)
else:
STACK.append(rttm_entry)
prev_speaker = speaker_label
prev_end = end_time
fw = open(new_rttm_file, "w+", encoding="utf-8")
for s in STACK:
fw.write(s + "\n")
return new_rttm_file
The label for audio is segmented into many small chunk, how could I improve this phenomenon.
p/s: the speaker prediction is quite good