FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

YouTube chunking timestamps are not quite right #86

Closed andylolz closed 2 months ago

andylolz commented 2 months ago

The way YouTube chunking is done means that the timestamps are often only approximately correct.

In particular, when we keep the end of a chunk as the start of the next chunk, we don’t rewind the start offset at all. This means it’s possible that an extracted claim might not have come from text that’s between the start and end timestamp. It also means that when we jump to the timestamp in the original audio, the text and the audio don’t quite match up which makes it a bit difficult for the user to anchor themselves.

It’s much easier to navigate if the timestamps are exactly correct, which I think it’s possible to do if we keep track of sentences alongside their original timestamps.