byroot / pysrt

Python parser for SubRip (srt) files
GNU General Public License v3.0
451 stars 69 forks source link

Captions whose text begins with Line Separator character are parsed as blank string #87

Open ontl opened 3 years ago

ontl commented 3 years ago

I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.

I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.

This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.

I think the parser to ignore this character.

VLC, for the record, ignores it and displays the caption normally.

Gotchas: It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n.

Example SRT that exhibits this problem:

1
00:00:08,330 --> 00:00:13,653

This caption starts with the character
u2028, which causes PySRT to see it as blank.

2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:
 which does not cause issues.

3
00:00:18,305 --> 00:00:22,906

This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.

Output:

ontl commented 3 years ago

After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')

Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.