cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.
MIT License
473 stars 44 forks source link

SRT tooling should allow using non-system encodings #27

Closed adrianholovaty closed 7 years ago

adrianholovaty commented 8 years ago

Hi there,

Just a heads up — I just tried using this library on a collection of SRT files I have, and I'm getting an SRTParseError exception on each file. I've attached one of the files here:

srttest.txt

cdown commented 8 years ago

Hey, sorry, I didn't see this until now.

The reason you're getting an SRTParseError is because this file contains a UTF-8 BOM, which is unsupported (srt tools open files with encoding utf-8, not utf-8-sig).

I've been considering adding some way for people to flag other encodings with the command line tools, let me look into it.

cdown commented 8 years ago

(Specifically, it is because argparse.FileType() does not accept an encoding parameter)

cdown commented 8 years ago

I think the most reasonable way to solve this is to set your encoding correctly in your environment, to be honest.

Here is a half-baked attempt at solving the issue with an -e parameter, but doing this for output is a bit more ugly (requiring some copying of code for now):

commit 32063fa63de57a9a7a452a907161ac3636a03067
Author: Chris Down <chris@chrisdown.name>
Date:   Tue Sep 27 00:22:36 2016 +0100

    Add -e/--encoding

diff --git srt_tools/srt-fix-subtitle-indexing srt_tools/srt-fix-subtitle-indexing
index 2c44419..a01d6c0 100755
--- srt_tools/srt-fix-subtitle-indexing
+++ srt_tools/srt-fix-subtitle-indexing
@@ -6,8 +6,8 @@ import srt

 def main():
     args = srt_tools.utils.basic_parser().parse_args()
-    subtitles_in = srt.parse(args.input.read())
-    output = srt.compose(subtitles_in, strict=args.strict)
+    srt_tools.utils.set_basic_args(args)
+    output = srt.compose(args.input, strict=args.strict)
     args.output.write(output)

diff --git srt_tools/utils.py srt_tools/utils.py
index f08a5d6..13602a4 100755
--- srt_tools/utils.py
+++ srt_tools/utils.py
@@ -1,32 +1,39 @@
 #!/usr/bin/env python

 import argparse
+import srt
 import sys
 import logging
 import itertools
+import collections
+
+
+log = logging.getLogger(__name__)

 def basic_parser(multi_input=False):
     parser = argparse.ArgumentParser(description=__doc__)

+    # Cannot use argparse.FileType as we need to know the encoding from the
+    # args
+
     if multi_input:
         parser.add_argument(
             '--input', '-i', metavar='FILE',
-            action='append', type=argparse.FileType('r'),
+            action='append',
             help='the files to process',
             required=True,
         )
     else:
         parser.add_argument(
             '--input', '-i', metavar='FILE',
-            default=sys.stdin, type=argparse.FileType('r'),
+            default=sys.stdin,
             help='the file to process (default: stdin)',
         )

     parser.add_argument(
         '--output', '-o', metavar='FILE',
         default=sys.stdout,
-        type=argparse.FileType('w'),
         help='the file to write to (default: stdout)',
     )
     parser.add_argument(
@@ -40,9 +47,26 @@ def basic_parser(multi_input=False):
         const=logging.DEBUG, default=logging.WARNING,
         help='enable debug logging',
     )
+    parser.add_argument(
+        '--encoding', '-e',
+        help='the encoding to read/write files in (default: '
+             'from your environment)',
+    )
     return parser

+def set_basic_args(args):
+    if isinstance(args.input, collections.MutableSequence):
+        for i, input_filename in enumerate(args.input):
+            with open(input_filename, encoding=args.encoding) as input_f:
+                args.input[i] = srt.parse(input_f.read())
+    elif args.input is sys.stdin:
+        log.warning('-e/--encoding has no effect on %s', stream)
+    else:
+        with open(args.input, encoding=args.encoding) as input_f:
+            args.input = srt.parse(input_f.read())
+
+
 def sliding_window(seq, width=2):
     seq_iter = iter(seq)
     sliced = tuple(itertools.islice(seq_iter, width))
cdown commented 7 years ago

Added in dea16c2. Confirmed it works with your test input:

srt encoding % PYTHONPATH=. srt_tools/srt-fixed-timeshift --seconds 5 -i ~/Downloads/srttest.txt -o /tmp/srttest-out.srt -e utf-8-sig
srt encoding % diff -u ~/Downloads/srttest.txt /tmp/srttest-out.srt | head -20                                                       
--- /home/cdown/Downloads/srttest.txt   2016-12-25 09:44:45.677276478 +0000
+++ /tmp/srttest-out.srt    2016-12-25 09:44:55.577107368 +0000
@@ -1,169 +1,169 @@
 1
-00:00:04,500 --> 00:00:07,250
+00:00:09,500 --> 00:00:12,250
 Hello, my name is Yorgui Loeffler, and I'm in Quebec, dang it!

 2
-00:00:07,670 --> 00:00:10,930
+00:00:12,670 --> 00:00:15,930
 In this lesson, I will be showing you a few things that are important to me.

 3
-00:00:13,720 --> 00:00:15,870
+00:00:18,720 --> 00:00:20,870
 Always scratch with a pick!

 4
-00:00:45,200 --> 00:00:49,290

Merry Christmas! :-)