Closed jugmac00 closed 10 months ago
In the beginning, optimization wasn't the focus, simply automating spell-checking. I honestly didn't expect people to find it and for it to become as popular as it has started to become 🙂. I am aware of the impact related to performance as I do have a few projects that are quite large, and yes I notice the speed issue.
I should note that the built-in Python filters do not have to be used. If you are using Aspell, and the native Aspell filters are sufficient, you can just put Aspell into the correct mode and omit any of the Python filters this could potentially make things much faster. With that said, I often use the Python filters as PySpelling solves two issues I had: 1) I wanted to automate all the spell checking in CI, and 2) I felt at times the Aspell native filter modes were not sufficient for omitting certain content. At least for me.
PySpelling makes setting up automated spelling easier and in addition, allows you to use custom Python filters, but yeah, the generic interface and Python filtering process can be slower than the optimized Aspell filters written in C.
I imagine there are a few areas of possible improvement. One area is with how many times encoding and decoding of Unicode happens.
In general, we pass bytes content to the spell checkers. Content is read from the files as bytes content as well. Filters processing content for the first time need to process bytes content because we don't know what the encoding is yet, and encoding can be specified in headers for a given file type in a special way. For simplicity, all filters just take bytes buffers, but not all filters require bytes inputs for parsing the data once encoding is known.
Some filters that can handle Unicode fine and could be made Unicode aware, and what I mean by this is skip handling the buffer as bytes (once encoding is known). We could then keep buffers as Unicode (for most of the time). Filters would have to advertise they are Unicode ready and have an entry point for Unicode processing (assuming the encoding has already been identified by whatever the first filter was). They of course would need to return content as Unicode as well. I imagine this could reduce the encoding/decoding we do internally and give a speed boost. It does add complexity though, but it could have a noticeable improvement.
Another area that may cause overhead is the fact that PySpelling provides specific context for specific blocks. This may give you the function under which a docstring is found, or the HTML element under which some text is found. As we are often filtering out blocks of content, you lose such information as line numbers and such, so we provide what context we can, but in order to do this, each block is processed separately. More calls to the spell-checker means more overhead. We could add an option to just group all content for a given file at the end and make a single call. You'd lose context, but it'd be potentially faster.
The draw of writing this was that I could process content in advanced ways via 3rd party libraries without having to write low-level parsing for various file types. I could just use off-the-shelf libs. On the other side, you run into the issue where you don't have highly specialized, super-optimized filters, but they are flexible and powerful.
Anyway, to answer your question more directly, I am aware of the performance issues, I have considered making improvements, but it has also been lower in priority as not many people have yet complained about it, and it has been sufficient for my needs up until now. It does what it is supposed to do, but yes the speed on very large projects has been a mild annoyance for me.
@facelessuser There are perhaps some easy wins achieved by parallelising certain tasks. With some unpolished use of multiprocessing
(see https://github.com/facelessuser/pyspelling/compare/master...benjamb:pyspelling:pool) to run spell checker processes in parallel, I'm able to cut the time by at least half.
Time on laptop without changes:
$ time pyspelling
Spelling check passed :)
real 2m12.400s
user 1m54.921s
sys 0m17.336s
Time on laptop with changes:
$ time pyspelling -j 4
Spelling check passed :)
real 1m2.981s
user 2m21.077s
sys 0m24.269s
I see much better numbers when run against CI infra (where the single core performance is not so great), with a drop down to ~2 minutes, down from ~9 minutes.
I did get slightly better performance when wrapping the entirety of the _spelling_pipeline()
method with a concurrent.futures.ProcessPoolExecutor
, but that was a more invasive change (certain methods needed to be converted to or duplicated as @staticmethod
s to avoid pickling errors).
I assume this makes a bit of a mess when it comes to logging? I imagine you would have spelling errors from different files interwoven.
I will state, I'm not against the idea of multiprocessing. If we did this, I think it would be something that you could optionally turn on and specify the number of processes. Additionally, we'd need a sane way to output results. Interleaving results of multiple files may be pretty painful.
@benjamb I tried running your changes locally. If we were to do this, we'd have to make sure it was pretty solid. Currently, this breaks unit tests. While the first error may just be an "out of order" issue (not sure yet), the second is much more troubling:
> ForkingPickler(file, protocol).dump(obj) E RecursionError: maximum recursion depth exceeded while calling a Python object
➜ pyspelling git:(master) ✗ python3 -m pytest
==================================================================================== test session starts ====================================================================================
platform darwin -- Python 3.11.3, pytest-7.1.3, pluggy-1.0.0
rootdir: /Users/facelessuser/Code/github/pyspelling
plugins: xonsh-0.13.4
collected 120 items
tests/test_config.py .............. [ 11%]
tests/test_versions.py .... [ 15%]
tests/filters/test_context.py .. [ 16%]
tests/filters/test_cpp.py ............. [ 27%]
tests/filters/test_html.py .....................................F [ 59%]
tests/filters/test_javascript.py ..... [ 63%]
tests/filters/test_markdown.py .. [ 65%]
tests/filters/test_odf.py F..... [ 70%]
tests/filters/test_ooxml.py .... [ 73%]
tests/filters/test_python.py .................. [ 88%]
tests/filters/test_stylesheets.py .... [ 91%]
tests/filters/test_text.py .... [ 95%]
tests/filters/test_url.py .. [ 96%]
tests/filters/test_xml.py ... [ 99%]
tests/flow_control/test_wildcard.py . [100%]
========================================================================================= FAILURES ==========================================================================================
_____________________________________________________________________________ TestHTMLContext.test_html_context _____________________________________________________________________________
self = <tests.filters.test_html.TestHTMLContext testMethod=test_html_context>
def test_html_context(self):
"""Test HTML context."""
template = self.dedent(
"""
<html>
<head>
</head>
<body>
<ol>
<li>
<p>Some <code>dkdd</code> text
flga</p>
<div><pre><code>kdjk</code></pre></div>
helo
</li>
</ol>
</html>
"""
)
self.mktemp('test.txt', template, 'utf-8')
> self.assert_context(
'.html.yml',
[
'test.txt: html>body>ol>li',
'test.txt: html>body>ol>li>p'
]
)
tests/filters/test_html.py:1553:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/util.py:169: in assert_context
self.assertEqual([(f'{self.tempdir}/{i}' if i else i) for i in expected[:]], context)
E AssertionError: Lists differ: ['@te[34 chars]dy>ol>li', '@test_35606_tmp_dir/test.txt: html>body>ol>li>p'] != ['@te[34 chars]dy>ol>li>p', '@test_35606_tmp_dir/test.txt: html>body>ol>li>p']
E
E First differing element 0:
E '@test_35606_tmp_dir/test.txt: html>body>ol>li'
E '@test_35606_tmp_dir/test.txt: html>body>ol>li>p'
E
E - ['@test_35606_tmp_dir/test.txt: html>body>ol>li',
E + ['@test_35606_tmp_dir/test.txt: html>body>ol>li>p',
E ? ++
E
E '@test_35606_tmp_dir/test.txt: html>body>ol>li>p']
----------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------
Using hunspell to spellcheck html
Running Task: html...
> Processing: @test_35606_tmp_dir/test.txt
b'helo '
Command: ['/usr/local/bin/hunspell', '-l', '-i', 'utf-8', '-d', 'en_US']
b'Some text\n flga '
Command: ['/usr/local/bin/hunspell', '-l', '-i', 'utf-8', '-d', 'en_US']
Context: @test_35606_tmp_dir/test.txt: html>body>ol>li>p
Context: @test_35606_tmp_dir/test.txt: html>body>ol>li>p
__________________________________________________________________________________ TestODFFilter.test_fodt __________________________________________________________________________________
self = <tests.filters.test_odf.TestODFFilter testMethod=test_fodt>
def test_fodt(self):
"""Test `fodt` files."""
config = self.dedent(
"""
matrix:
- name: fodt
sources:
- 'tests/**/*.fodt'
aspell:
lang: en
d: en_US
hunspell:
d: en_US
pipeline:
- pyspelling.filters.odf
"""
).format(self.tempdir)
self.mktemp('.fodt.yml', config, 'utf-8')
> self.assert_spellcheck('.fodt.yml', ['tihs', 'smoe', 'txet'])
tests/filters/test_odf.py:48:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/util.py:204: in assert_spellcheck
for results in spellcheck(
pyspelling/__init__.py:688: in spellcheck
for result in spellchecker.run_task(task, source_patterns=sources):
pyspelling/__init__.py:345: in run_task
yield from self._spelling_pipeline(sources, options, personal_dict)
pyspelling/__init__.py:161: in _spelling_pipeline
with Pool(self.jobs, initializer=self._run_checks, initargs=(inputs, outputs)):
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py:119: in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py:215: in __init__
self._repopulate_pool()
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py:306: in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py:329: in _repopulate_pool_static
w.start()
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py:121: in start
self._popen = self._Popen(self)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py:288: in _Popen
return Popen(process_obj)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py:32: in __init__
super().__init__(process_obj)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_fork.py:19: in __init__
self._launch(process_obj)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_spawn_posix.py:47: in _launch
reduction.dump(process_obj, fp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = <SpawnProcess name='SpawnPoolWorker-138' parent=35606 initial daemon>, file = <_io.BytesIO object at 0x10501a2a0>, protocol = None
def dump(obj, file, protocol=None):
'''Replacement for pickle.dump() using ForkingPickler.'''
> ForkingPickler(file, protocol).dump(obj)
E RecursionError: maximum recursion depth exceeded while calling a Python object
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/reduction.py:60: RecursionError
----------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------
Using hunspell to spellcheck fodt
Running Task: fodt...
> Processing: tests/test_files/odf/test.fodt
================================================================================== short test summary info ==================================================================================
FAILED tests/filters/test_html.py::TestHTMLContext::test_html_context - AssertionError: Lists differ: ['@te[34 chars]dy>ol>li', '@test_35606_tmp_dir/test.txt: html>body>ol>li>p'] != ['@t...
FAILED tests/filters/test_odf.py::TestODFFilter::test_fodt - RecursionError: maximum recursion depth exceeded while calling a Python object
========================================================================= 2 failed, 118 passed in 63.74s (0:01:03) ==========================================================================
I've spawned an issue to more seriously look into multiprocessing. I think it is a good idea, though I imagine it may require some refactoring to do properly.
We could potentially utilise the map()
method of either Pool
or ProcessPoolExecutor
to guarantee ordered results. Assuming you don't beat me to it, I'll look into this later in the week.
I'm not quite sure right now the level at which the multiprocessing is happening in your example. I imagine we want it to be per file, so all the results in a given file would be ordered. I care less that the files are in a particular order as long as the results for each file are aggregated in a payload and then returned together. There are definitely things to explore.
@facelessuser I've pushed my example of using ProcessPoolExecutor
with map()
. Results are ordered, but as previously noted it's somewhat more invasive: https://github.com/facelessuser/pyspelling/compare/master...benjamb:pyspelling:concurrent
I ran the tests locally with hunspell and python 3.12: they all passed.
@benjamb It seems to allow tests to pass. I will likely explore this a bit more. I am not thrilled by the invasive approach and I imagine this code could be made to be multi process aware better. I think this is a good experiment showing that there is a benefit it digging in this area deeper.
Sure, I've intentionally not submitted either approach as a pull request as they are both quite rough around the edges :sweat_smile:
Yep, ideally I think I'd like to see files being processed in parallel, not blocks within a file. I imagine each process needs to be operating on its own instance of plugins independently. But I like the general idea 🙂.
I think we may be able to have some faster processing in the next version. I've been playing around with threading and multiprocessing and I think I'm starting to converge on something I am happy with.
In the next version, 2.10, you will be able to set parallel jobs via the global jobs
option in .pyspelling.yaml
or via the command line via --jobs 8
or -j 8
, etc. The command line always overrides what is in the config.
For now, I'm going to consider this closed via https://github.com/facelessuser/pyspelling/pull/186. This should greatly decrease processing time when used.
@facelessuser Amazing! Works brilliantly. Thanks for following up on this!
@benjamb Your welcome! Using multi-processing is definitely a great improvement, so thanks for the great idea. It cut down execution in one of my larger projects by about a third using -j 8
. Obviously, mileage will vary depending on the system and payload.
Now, when a task is being processed, it is broken up into jobs, each job being its own file. Each job generates its own spelling class object to handle it. This allows multiple files to be processed simultaneously.
One important thing is that we made sure that -j 1
skips multi-processing as that extra overhead for a single-threaded, single-process job made no sense and was a detriment to default performance.
I played around with both threading and multi-processing, and while both were faster than the default, true concurrency with multi-processing obviously won out.
Hi there,
we, the Launchpad team at Canonical, have started using
pyspelling
and we are very grateful for your work.Our project currently consists of about 50 pages, and will grow to several hundred.
With 50 pages spellchecking already takes almost a minute.
Do you see any chances to improve the speed? Or are there any intentions to do so?
Thank you!