Chunks of YouTube transcripts should include their timestamp offsets

FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google

MIT License

0 stars 0 forks source link

Closed dcorney closed 2 months ago

dcorney commented 2 months ago

Currently, we get and store the offset (in seconds) with each bit of text. But when we form chunks of text to pass to an LLM, we discard the offset.

Track the offset of each chunk.

If the chunks are long, the offset might be quite a long way before the claims within the chunk.

dcorney commented 2 months ago

I closed this prematurely: the "extra" copies of youtube.py and vertex.py need addressing too. See #33