cs50 / video.cs50.net

13 stars 6 forks source link

Be sure to parse VTT files that don't have identifiers #117

Closed coltonoscopy closed 7 years ago

coltonoscopy commented 7 years ago

via @dmalan and @dlloyd09

In case we ditch in the future :)

lukejacksonn commented 7 years ago

Given that they VTT files are keyless data structures, would it not make more sense to try keep the schema consistent across all files of that type, that is, decide on wether to go ahead and use ID's on all VTT files or none now. The integer IDs holds little value at the moment as they always correlate with the count/order of the items and we don't use them for anything; if that helps influence a decision at all.

Also, if https://github.com/cs50/video50/issues/91 is ever actioned then JSON50 would handle the conversion when generating the episode JSON for each episode (this would drastically decrease the complexity of the application state and eradicate some quite expensive fetching/parsing operations).

dmalan commented 7 years ago

We're going to phase out the above-timecode identifiers as unnecessary, since we don't use them for styling, and just having numbers isn't really useful anyway. So just want to ensure the parser doesn't break when our next batch of VTTs lacks identifiers. We'll likely go back and regenerate all existing VTTs soon too once the parser can accommodate.

lukejacksonn commented 7 years ago

The changes here https://github.com/cs50/video50/pull/121 means that the parsing function can handle the existing schema:

WEBVTT

1
00:00:00.000 --> 00:01:58.000
This is CS50.

2
00:01:58.000 --> 00:03:37.000
Course Overview

...

and the proposed schema (with no identifiers):

WEBVTT

00:00:00.000 --> 00:01:58.000
This is CS50.

00:01:58.000 --> 00:03:37.000
Course Overview

...

It does this by detecting the length of each entry after splitting by \n. If the length is equal to 3 [id,time,title] then it uses the old parser, if it is anything but 3 it will use the new parser (expecting a length of two [time,title])

dmalan commented 7 years ago

This doesn't sound safe, actually, since the payload of captions can technically contain \n. Can we instead split on \n\n and then, for each set of strings in the resulting array, check if the first such string is a time code (as defined by timecode --> timecode) or not? If it's not, you can then assume that the first such string is a unique identifier.

lukejacksonn commented 7 years ago

Sure, that does sound safer. Implemented here https://github.com/cs50/video50/pull/124.