FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

fix: more accurate chunk start and end offsets #87

Closed andylolz closed 2 months ago

andylolz commented 2 months ago

Fixes #86.

One problem with keeping track of the running sentence text is: when we discard everything but the last 500 characters (in order to form overlapping blocks) we lose timestamp offset information.

Here, I’ve changed this approach, to ensure we keep timestamp offset information with the sentence text.

One thing that is potentially weird here is: the final chunk will not have an end offset, because we have no way of knowing that (the end of the final chunk is the end of the video. So if we knew the length of the video, we could set it to that.) I’ve set it to None here, but we’ll need to amend other code to deal with this.


Pull request checklist