Open ontl opened 3 years ago
After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')
Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.
I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.
I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.
This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.
I think the parser to ignore this character.
VLC, for the record, ignores it and displays the caption normally.
Gotchas: It may make sense to pre-process the file, replacing u2028 with a more compatible line break like
\n
. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with\n
.Example SRT that exhibits this problem:
Output: