caption_to_sentences() generates too big caption item

EbaraKoji commented 3 months ago

When subsequent transcribed caption items does not start with the start of sentences, converting into sentences results in large items.

python src/main.py 5h-JBkySK34 -m audio --transcribe 1 --translate 1

transcribed_raw.vtt

WEBVTT

1
00:00:00.000 --> 00:00:06.000
Hello everyone. Today I want to talk about Lenggraph, a new library that we're

2
00:00:06.000 --> 00:00:11.000
releasing. So Lenggraph builds on top of Lengchain and makes it really easy to

3
00:00:11.000 --> 00:00:17.000
create agents and agent runtimes. So what exactly is an agent and an agent

4
00:00:17.000 --> 00:00:22.000
runtime? So in Lengchain we define an agent as a system powered by a language
...

transcribed.vtt

WEBVTT

1
00:00:00.000 --> 00:02:59.000
Hello everyone. Today I want to talk about Lenggraph, a new library that we're releasing. So Lenggraph builds on top of Lengchain and makes it really easy to...

Maybe need to separate large caption items into appropriate size of ones, but determining duration of each item is not an easy task. One option is to decide on the length of sentence and it may work fine in many cases, but this method does not guarantee setting accurate durations.

EbaraKoji commented 3 months ago

Possible procedures

In whisper model, set word_timestamps to be True model.transcribe(audio_path, word_timestamps=True)
Using spacy to detect sentence boundary (check if token.is_sent_end)
create well-separated caption items and save caption file

EbaraKoji commented 3 months ago

Though caption_to_sentences() is not changed, word_timestamp_to_caption() has been added and seemingly solved the problem.

EbaraKoji / youtube-downloader

caption_to_sentences() generates too big caption item #10

Possible procedures