nthykier commented 3 months ago

Adding a Spellchecker API

The proposed PR is a follow up to #3419 that introduces a Spellchecker class. This PR deliberate does not close #3419, since the public package name should probably be changed first and an __all__ provided in its __init__,py. In its simplest form, a consumer would do something like this:

# Per #3419, we may want to shuffle around the top-level name so leaving this one "undefined" at the moment
from ... import Spellchecker

# Loads "clear" + " rare" by default, use `Spellchecker(builtin_dictionaries=[...])` for choose.
s = Spellchecker()
text = """\
some
long text
with
tpyos
"""
# Provide a tokenizer (simple one for the sake of the example)
# The codespell tokenizer is a bit more involved, see `line_tokenizer_factory`
tokenizer = re.compile(r"[^ ]+").finditer
lines = text.splitlines()
for line in lines:
    s.spellcheck_line(line, tokenizer)

There is also a check_lower_cased_word method for single word checks which as a part of refactoring. I left it in because it has a simpler API if you have words you want to have checked rather than lines.

Reviewing this PR

I have optimized the review the PR by reviewing each commit individually with the aim of understanding how we got to the end result. It comes with the following commits:

$ git log origin/master..HEAD | git shortlog
Niels Thykier (11):
      Refactor: Move some code to new files for reuse
      Replace `data: str` with `candidates: Sequence[str]`
      Refactor dictionary into a new `Spellchecker` class
      Refactor line tokenization to simplify an outer loop
      Rewrite line spellchecking and move most of it into the `Spellchecker`
      De-indent loop body (whitespace-only change)
      Make `Spellchecker()` load builtin dictionaries by default
      Support non-regex based tokens for `spellcheck_line`
      Speed up spellchecking by ignoring whitespace-only lines
      Move `codespell:ignore` check into `Spellchecker`
      Speed up `codespell:ignore` check by skipping the regex in most cases

Each commit is written with the intend to be a small self-contained change. This also means that some commits are just moving code around or even seem a bit sub-optimal with following commits just using them as stepping stone. If you see something in a commit that you want changed, please note that it might be improved in a later commit. I still kept this as one PR since some of the commits ends up with a considerable performance regression in that commit, which is then resolved in later commits. This is all a trade off between keeping commits as standalone and reviewable as possible vs. being able to cut the PR at any given point time. I preferred the former.

Note that GitHub's PR review function (like Approve or Request Changes) operates on the entire PR even though you review it commit by commit.

How the API interfaces with codespell related activities

Dictionary loads

The API by default just loads the default builtin dictionaries. The constructor accepts a sequence/list of names such as clear, rare. After that, the caller can manually load custom dictionaries via load_dictionary_from_file. This is the default mode and aimed at the API consumers to quickly get started with the common case.

The codespell command is not a common case. Here, dictionaries can be loaded in any other. To facilitate that, the following flow can be used:

s = Spellchecker(builtin_dictionaries=None)
s.load_dictionary_from_file("my-early-loaded-custom-dictionary.txt")
s.load_builtin_dictionaries(["clear", "rare", "informal"])
s.load_dictionary_from_file("my-late-loaded-custom-dictionary.txt")

Various exclusions or regex based matches/ignores

Exclusions features supported directly in the API:

Inline ignores (codespell:ignore). Fully handled by default, caller can opt-out.
Ignore words, ignore words cased. Partially supported. Caller must parse/provide these to the spellchecker either at initialization or the relevant API call.

Not supported directly by the API in its current form. Caller must facilitate these:

Exclude lines (-X). Caller should load exclude the line instead.
All regex based matches / ignores. The caller should set these up and provide them in the tokenizer. (The codespell tokenizer "factory" plus related default regexes could become part of the API if desired)
--uri-ignore-words-list: Hidden inside the tokenizer.

What is to be included in the API

These are the items that I would see go into the __all__ of the codespell API in a later commit / PR.

Spellchecker (introduced in this PR). All its methods would become public API.
Misspelling (existing class). All its data fields would become public API (or at least,candidates + fix would be neede for me). Also, we should probably rename it to Correction or DictionaryResult, etc., since it is not a misspelling but the codespell correctional data.
- I would personally recommend that this class becomes read-only at an API level. However, that would require that codespell uses a different way to remember its interactively chosen corrections.
DetectedMisspelling (introduced in this PR). Subject to renaming of the class or its properties.
LineTokenizer + Token + their generic constraint (T) for typing purposes.

That is the basic API I envisioned and I would need. This API would still have a considerable amount of "bring your own" code, notably the tokenizer is something people will probably struggle with. My project has its own with its own rules for how things should be split into tokens (like ignoring quoted words), so this was not a goal for me. However, it could still be important for you, since it will affect more casual API consumers that did not solve this problem ahead of time. :)

An alternative to more API would to have more examples of how to emulate the codespell command line tool.

Performance cost of the API

It surprised me a bit that at some point the total regression exceeded 10% on the corpus I used for testing (see #3419 for the details on the performance test setup). It was restored to an ~8% regression before the first Speed up commit.

The first Speed up commit reduces the regression to <= 1.5% (as in, the ~0.060s ballpark) The second Speed up commit was "unnecessary show off" to get below the baseline. :laughing:

These commits speed up commits are technically a "compensation" changes. As an example, I believe you would get a similar boost of performance by retrofitting the second speed up commit on the existing code base for a similar win. The first speed up commit is partly a compensation, partly a counter of the regression since the "overhead per line" got measurably higher (accordingly, reducing the number of redundant lines now makes the improvement considerably more worth it).

The end result is 5.6s -> 4.9s with an API and 1½-2 performance related compensations.

(Side note for GitHub: This also closes #3434.)

larsoner commented 2 months ago

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

nthykier commented 2 months ago

@nthykier let me know when I should look. I at least just approved CI runs.

One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs).

Thanks.

This PR should be ready for review now. :)

DimitriPapadopoulos commented 2 months ago

After this nice refactoring work, I suspect the maximum value of some complexity metrics might be lowered: https://github.com/codespell-project/codespell/blob/3760e619e2a7b7fa8467debb2889d21a34651853/pyproject.toml#L169-L174

nthykier commented 2 months ago

After this nice refactoring work, I suspect the maximum of some complexity metrics might be lowered:

https://github.com/codespell-project/codespell/blob/3760e619e2a7b7fa8467debb2889d21a34651853/pyproject.toml#L169-L174

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

DimitriPapadopoulos commented 2 months ago

Quite possibly. What do you (as a project) expect here? That I lower them to the default values for pylint, see if that works and re-add the keys needed for the CI to pass lowered the the exact amount required to have the CI pass?

Exactly. The default values won't work, but pylint will show you new lower values.

That can of course be deferred to a later PR or left to other maintainers - at your convenience.

codespell-project / codespell

Add Python spellchecking API #3425