cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.
MIT License
474 stars 43 forks source link

Add support for multithreaded parsing #42

Closed cdown closed 6 years ago

cdown commented 6 years ago
In [6]: %lprun -f srt.parse list(srt.parse(srt_data))
Timer unit: 1e-06 s

Total time: 0.023734 s
File: /home/cdown/git/srt/srt.py
Function: parse at line 288

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   288                                           def parse(srt):
   289                                               r'''
   290                                               Convert an SRT formatted string (in Python 2, a :class:`unicode` object) to
   291                                               a :term:`generator` of Subtitle objects.
   292                                           
   293                                               This function works around bugs present in many SRT files, most notably
   294                                               that it is designed to not bork when presented with a blank line as part of
   295                                               a subtitle's content.
   296                                           
   297                                               .. doctest::
   298                                           
   299                                                   >>> subs = parse("""\
   300                                                   ... 422
   301                                                   ... 00:31:39,931 --> 00:31:41,931
   302                                                   ... Using mainly spoons,
   303                                                   ...
   304                                                   ... 423
   305                                                   ... 00:31:41,933 --> 00:31:43,435
   306                                                   ... we dig a tunnel under the city and release it into the wild.
   307                                                   ...
   308                                                   ... """)
   309                                                   >>> list(subs)  # doctest: +ELLIPSIS
   310                                                   [Subtitle(...index=422...), Subtitle(...index=423...)]
   311                                           
   312                                               :param str srt: Subtitles in SRT format
   313                                               :returns: The subtitles contained in the SRT file as py:class:`Subtitle`
   314                                                         objects
   315                                               :rtype: :term:`generator` of :py:class:`Subtitle` objects
   316                                               '''
   317                                           
   318         1          2.0      2.0      0.0      expected_start = 0
   319                                           
   320      1117       2901.0      2.6     12.2      for match in SRT_REGEX.finditer(srt):
   321      1116        826.0      0.7      3.5          actual_start = match.start()
   322      1116       1033.0      0.9      4.4          _raise_if_not_contiguous(srt, expected_start, actual_start)
   323                                           
   324      1116       1119.0      1.0      4.7          raw_index, raw_start, raw_end, proprietary, content = match.groups()
   325      1116        579.0      0.5      2.4          yield Subtitle(
   326      1116       7408.0      6.6     31.2              index=int(raw_index), start=srt_timestamp_to_timedelta(raw_start),
   327      1116       6455.0      5.8     27.2              end=srt_timestamp_to_timedelta(raw_end),
   328      1116       2641.0      2.4     11.1              content=content.replace('\r\n', '\n'), proprietary=proprietary,
   329                                                   )
   330                                           
   331      1116        768.0      0.7      3.2          expected_start = match.end()
   332                                           
   333         1          2.0      2.0      0.0      _raise_if_not_contiguous(srt, expected_start, len(srt))

This shows we could probably gain a lot by doing everything in the iterations in threads

cdown commented 6 years ago

Seems the workload per-worker is way too small:

Before:

In [3]: %timeit -n 100 list(srt.parse(srt_data))
12.3 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

After:

In [15]: %timeit -n 100 list(srt.parse(srt_data))
32.7 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
srt develop % gd
diff --git srt.py srt.py
index 2d620ae..764f80a 100755
--- srt.py
+++ srt.py
@@ -7,6 +7,7 @@ import functools
 import re
 from datetime import timedelta
 import logging
+import multiprocessing

 log = logging.getLogger(__name__)
@@ -317,20 +318,29 @@ def parse(srt):

     expected_start = 0

-    for match in SRT_REGEX.finditer(srt):
-        actual_start = match.start()
-        _raise_if_not_contiguous(srt, expected_start, actual_start)
+    # _sre.SRE_Match objects are not serialisable by pickle
+    match_iter = (match.groups() for match in SRT_REGEX.finditer(srt))
+
+    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
+    subs = pool.imap(_parse_single, match_iter, chunksize=10)
+    pool.close()
+    return subs

-        raw_index, raw_start, raw_end, proprietary, content = match.groups()
-        yield Subtitle(
-            index=int(raw_index), start=srt_timestamp_to_timedelta(raw_start),
-            end=srt_timestamp_to_timedelta(raw_end),
-            content=content.replace('\r\n', '\n'), proprietary=proprietary,
-        )

-        expected_start = match.end()
+def _parse_single(match):
+    r'''
+    Given a regex match of an SRT block, convert it to a :py:class:`Subtitle`.

-    _raise_if_not_contiguous(srt, expected_start, len(srt))
+    :param re.MatchObject match: A regex match of an SRT block
+    :returns: The subtitle, the start of match, and the end of match
+    :rtype: (:py:class:`Subtitle`, int, int)
+    '''
+    raw_index, raw_start, raw_end, proprietary, content = match
+    return Subtitle(
+        index=int(raw_index), start=srt_timestamp_to_timedelta(raw_start),
+        end=srt_timestamp_to_timedelta(raw_end),
+        content=content.replace('\r\n', '\n'), proprietary=proprietary,
+    )

 def _raise_if_not_contiguous(srt, expected_start, actual_start):
cdown commented 6 years ago

Better with chunksize=200, but still not worth it:

In [3]: %timeit -n 100 list(srt.parse(srt_data))
26.3 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cdown commented 6 years ago

Not doing based on above