In the tutorial notebook, alignment_durations, _, tokenized_text_tokens = alignment_extractor.extract_alignment("LJ037-0171_sr16k.wav", en_transcription, plot=True, add_trailing_silence=False)
The alignment_durations, _are both a 1*160 dimensions of matrix: [[x,x,x,x....x]]. I expected they were sth like SRT/VTT subtitles with start and end times data.
Why this way? How to convert them into start-end-time data?
In the tutorial notebook,
alignment_durations, _, tokenized_text_tokens = alignment_extractor.extract_alignment("LJ037-0171_sr16k.wav", en_transcription, plot=True, add_trailing_silence=False)
Thealignment_durations
,_
are both a 1*160 dimensions of matrix: [[x,x,x,x....x]]. I expected they were sth like SRT/VTT subtitles with start and end times data.Why this way? How to convert them into start-end-time data?