nthykier commented 2 months ago

Hi,

Would you be open to supporting a public stable python API for codespell. Ideally for me, one where I as a consumer can feed codespell with words I want spellchecked and then is given back a valid or invalid, here are the corrections available. If you auto-correct, use this choice (or "Not safe for auto-correcting" if that is the data available).

My use case is that I am working on a Language Server (LSP, https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification) and I want to provide spell checking. So far, I have relied on hunspell because it had Python bindings but its accuracy on technical documents with the dictionary I found leaves a bit to be wanted.

In my use case, being able to identify the exact range of the problem is an absolute necessity as my code need to provide the text range for the editor, such that it can show the user exactly where in the open document the problem is. If word-by-word checking is not supported, then API can be pass lines of text to be spellchecked provided the result identifies exactly in the range where the problem is (that is, I need start + end index).

In this case, I would probably also need the API docs to state a bit about why the line by line text is important, because I might need to extract the underlying text from formatting to create the a synthetic line to be spellchecked. As an example, my current code tries to identify and hide common code/file references like usr/share/foo. In a word-by-word check, I just skip the the spellcheck call for that word. But if I need to pass a line of text to codespell, I would need to removed the ignored and here it is relevant to know how to do that replacement such that the user does not get a false positive because I attempted to avoid another false-positive.

Alternatively, a parser for the dictionaries plus the underlying dictionaries might also be an option a "light weight API" assuming they are easier to keep stable.

I have noted that codespell can do spell checking from stdin to stdout. However, that is a bit too heavy handed for me to easily replace my hunspell integration.

That is my wishlist. :) Would this be something that you would be open to supporting?

Note: By stable API, I assumed nothing was stable since __all__ = ["_script_main", "main", "__version__"] does not look it contains reusable APIs.

DimitriPapadopoulos commented 2 months ago

Exposing a function to check words or lines, one that is actually used by codespell itself, might affect performance. On the other hand:

the performance hit is just speculation without actual tests to measure performance,
exposing functions would probably reduce the complexity of parse_file(), a long-overdue target (see #3276).

Do you agree refactoring is a necessary step to expose functions that could be added to a public API later on ? If so, could you have a look at #3276, see if it's a step in your direction, and create a new pull request from it? A test to measure performance would be a great addition. I am not the maintainer, but I think these steps might convince @larsoner.

I think you would need to refactor near the code that starts line processing: https://github.com/codespell-project/codespell/blob/0b09d75a330b38c012265a243a3c2f0c988e1d8c/codespell_lib/_codespell.py#L967 and near the code that splits lines into words: https://github.com/codespell-project/codespell/blob/0b09d75a330b38c012265a243a3c2f0c988e1d8c/codespell_lib/_codespell.py#L989

Alternatively, the lightweight API to access dictionaries sounds like a good alternative indeed.

nthykier commented 2 months ago

Thanks for the feedback and the suggestions.

I could be convinced to do a PR for this change, but I would prefer to have alignment up front what the expectations would be to determine if that is within my capacity to deal with it and matches the amount of time I am willing to invest. I am hoping we can meet in the middle obviously, but I would rather know upfront so there is no disappointment on either side down the line. Like, if there is going to be a performance test, what is expectations on "before vs. after" performance, what kind of corpus would you consider a "valid performance test" (are we talking micro-benchmark or real-life use-cases or both; which versions of python count, etc.)

For clarity, I read phrases like "would be nice" or "a great addition" as "optional and would not prevent merge if omitted".

I guess that means I am waiting for @larsoner to respond on the expectations. :)

larsoner commented 2 months ago

I think as long as the changes are reviewable and there isn't much performance hit we're okay. Something like a 10% performance hit would be kind of a bummer. But I would also be surprised if there were one based on briefly thinking about the code.

One approach would be to make a quick attempt at this with the necessary refactoring and some new Python API, and some basic tests in a new codespell_lib/tests/test_api.py or similar. Hopefully that's not too much work based on what @DimitriPapadopoulos outlined above? And then we can check to make sure performance isn't too bad, then iterate on the API as proposed in the PR.

DimitriPapadopoulos commented 2 months ago

@larsoner I wonder whether this will eventually require some file/directory renaming. All files currently live under codespell_lib. However, Python modules usually put Python code under codespell or src/codespell. I wonder why this is not the case here, and whether we should change that.

larsoner commented 2 months ago

Can't remember why it was codespell_lib. It could be a reasonable time to mkdir src && git mv codespell src/ and adjust pyproject.toml. I think to change this we'd probably want the release to be 3.0 but that would be okay.

It can/should be done in a separate PR though

nthykier commented 2 months ago

Ok, thanks for the input so far. Up to 10% performance is quite a leeway indeed and like larsoner, I doubt we will hit it. Nevertheless, I will establish a baseline that I will use.

I did not see any references to how we decide the baseline. For now, I will be using https://sherlock-holm.es/stories/plain-text/cano.txt as a base text. Since what we are concerned with is the cost of refactoring into having a spellcheck_word function that is called per word (consider the name a description more than a target at this point), I did not think it relevant to look at different file types. Let me know if you have reason to believe otherwise.

I am describing my method below. Ideally, none of this would be new to you if you have prior experience with performance measurements of this kind.

The setup:

mkdir performance-corpus/
cd performance-corpus
wget https://sherlock-holm.es/stories/plain-text/cano.txt
# Create some extra copies (a single file takes less than 0.5 second; adding a bit more makes it easier to test)
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

# Setup `_version.py`, so we can run directly from git checkout.
echo '__version__ = "do not crash, please"' > codespell_lib/_version.py

The resulting corpus is 79329 bytes. The wc command line tool estimates that each file has 76764 lines (including many empty ones) and 657438 word (in wc definition of a word). The codespell code flags about 175 words as possible typos in this text (that is, per file).

The python version I used for these numbers:

$ python3 --version
Python 3.11.9

Running the test:

# Test, repeated at least 3 times, fastest result is chosen;
# Usually slower times are caused by unrelated jitter - at least on laptops/desktops that
# also does other work
time PYTHONPATH=. python3 -m codespell_lib performance-corpus/ >/dev/null

# One run with `-m profile` to see where the bottlenecks are.
# Takes considerably longer (x10-x20), so it is not done on the full corpus
PYTHONPATH=. python3 -m profile -o baseline.pstat  -m codespell_lib performance-corpus/cano.txt

Baseline results

On my hardware, the performance test on the baseline is recorded as 5.6 seconds (0m5.600s).

The bottleneck profiling has the following key numbers as far as I can tell:

Total time 3.030s as baseline (2196160 function calls (2195259 primitive calls) in 3.030 seconds)
The code spend 0.503s (~1/6) in build_dict (split into 2 calls)
The code spend 2.472s (~5/6) in parse_file in one call [only one file was examined].
About of half of parse_file's runtime is spend in parse_file (1.178/.2.472). The other half is then assumed to come from callers.
The time spent in "".lower() is 0.433s (~1/6) and another ~1/6 is spend in regex operations (group() + search() + finditer()).

These numbers were extracted by an interactive session with the following commands (as examples):

import pstats
s = pstats.Stats("baseline.pstat")
s.sort_stats(2).print_stats(20)
s.print_callees('parse_file')

Types of corrections

Per file, we get 175 suggestions. These are split into the following groups:

Cased (capitalized or all upper): 41 (~23%)
Non-cased: 134 (~77%)
Corrections with multiple candidates: 77 (44%)
- Cased with multiple candidate corrections: 37 (21% of the 175 corrections)
- Non-cased with multiple candidate corrections: 40 (~23% of the 175 corrections)
Corrections with reason: 0

Some ratios:

Cased / non-cased: ~30% (40/134).
- Strong bias towards non-cased (seems reasonable since most words are lower-cased)
Multiple candidates cased / Multi candidate non-cased corrections: 93% (37/40).
- This one is probably too close to 1:1. Given the strong bias towards non-cased, one would expect a similar bias here. This is a potential weakness with the chosen corpus.
Multiple candidates / Single candidate corrections: 79% (77/98).
- This one is close to 1:1, so we get decent coverage of both cases.

DimitriPapadopoulos commented 1 month ago

~~Perhaps it would be worth checking with more files, so that application startup time remains small compared to individual file checking.~~

I guess 20 files ought to be enough;

for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

codespell-project / codespell

Python API support #3419

The setup:

Running the test:

Baseline results

Types of corrections