refactor: modernisation of proselint

Nytelife26 commented 6 months ago

this is a follow on from #1361. credit to @orgua for the initial work here.

following a request that no work from the initial refactoring effort should be used, preserved below, the oxidation of proselint begins. it can be observed here that the time it now takes to launch proselint, parse CLI options, find config paths for both JSON and TOML, and deserialize them, is faster than it previously took just to load the CLI options. it is worth noting that the previous measurements were taken even after a highly optimised refactor that involved replacing click with a simplified parse function.

it may take some time to get this all up and running. i would like to thank any onlookers for bearing with me.

keeping this here so i don't forget:

moka for caching(? few libraries support TTL in an LRU, which is the style used in proselint)
postcard for cache persistence, due to its relatively good performance and efficient output size

orgua commented 5 months ago

I hereby make use of my copyright and revoke the permission to use any of my code (commits & branches). Good thing is that nothing has been merged yet. I'm licensing it under a private license that specifically forbids usage in proselint.

Nytelife26 commented 5 months ago

i can respect your wishes there quite easily. i would, however, like to note note that this (along with your other recent behaviour) is quite unprofessional. additionally, under D.5 and D.6 of github's terms of service, as i am sure you're aware, that would not work if i were genuinely a bad actor.

orgua commented 5 months ago

I am the unprofessional one, of course. I just wanted to coexist and contribute. To stay with your top-notch analogy: The keys were handed to us both. You decided to change the locks. You made an overstepping and backstabbing move for power. I'll not be thrown under the bus. So i'll go and take my copyright with me. Read again through your paragraphs. I did not contribute - nothing was merged - and i rewoke my consent for any future merges of my work.

Thanks for respecting my wishes. bye

Nytelife26 commented 4 months ago

work will now resume following completion of my university exams for the year. current status is as follows:

general check registry system is being implemented. this will eventually be used as the core for extensibility
following implementation of the general registry for proselint in its current state, check key path splitting will be implemented. this will enable you to have more granular control over checks, including at the per-check level requested in #1375, but also for entire categories as well as submodules.
finally, the check functionality in its entirety will be replaced with a dispatch system. this will provide access to the original check functions, but additionally make it easier to write checks and modify internals without breaking everything. the intended format is detailed below.

porting efforts have been halted, and work is resuming in python for now. when i have completed the plans i have for proselint's current structure, they will continue.

however, many of these ideas came from deciding how best to implement proselint in other languages. i devised the following model for checks, which can be collected into a registry and dispatched sequentially or concurrently.

each check is a metadata object
- a check type enumeration is used, along with a data store for type specific information. this is used to determine which function the check specification will be dispatched to
- consistency (word pairs)
- preferred forms (items dict, padding)
- preferred forms simple (items dict, padding)
- existence (items list, unicode switch, dotall switch, exceptions list)
- existence simple (item list, unicode switch, exceptions list)
- reverse existence (allowed item list)
- reverse existence simple? (allowed item list)
- it will contain the full check path (e.g. misc.illogic.conclusion), with property computations for aforementioned path splitting and the name. optionally, this may be computed to ensure consistency and remove the need to manually normalise check names
- it will contain an ignore_case property, because this switch exists for all check types
- it will contain an offset tuple, because offsets exist for all check types

metadata implementation of limitations, such as limit_results and ppm_threshold, remains to be determined. this may happen with flags.

the current registry implementation introduces a roughly ~100ms performance regression, which is far from ideal. i will aim to clean this up promptly. however, the desired granularity has been achieved: the addition of partial keys makes it possible to specify key components, like simply airlinese, while the registry system makes it possible to skip checks on a per-function basis.

status report: i have made it up to proselint.checks.misc so far with the new dispatch system. this will resolve many issues incurred by the provisional implementation, and with some planned additions, like the flag system and a context accessor for custom check functions that do not conform to one of the provided check types, it will represent the final version of proselint's internal check system for the foreseeable future. i aim to have this finished, pending testing, by the end of the week.

Nytelife26 commented 4 months ago

final edit:

links.broken is still the slowest check to load, at ~20ms.

here are current benchmarks for proselint, from start to finish, with the demo file:

Benchmark 1 (uncached): python3 -m proselint --demo
  Time (mean ± σ):     253.4 ms ±  11.2 ms    [User: 667.4 ms, System: 139.2 ms]
  Range (min … max):   238.1 ms … 271.2 ms    10 runs

Benchmark 2 (cached): python3 -m proselint --demo
  Time (mean ± σ):     142.6 ms ±   4.6 ms    [User: 126.5 ms, System: 15.1 ms]
  Range (min … max):   137.6 ms … 159.5 ms    21 runs

tecosaur commented 1 month ago

This looks fantastic @Nytelife26! I see the last commit here is from July, is there much more that needs to be done?

Nytelife26 commented 1 month ago

is there much more that needs to be done?

I am in the process of porting proselint in its entirety, which became a necessity after an unfortunate conflict arose with the original author of the refactor. I see this as the best possible path forward for proselint - a fresh start, which will come with performance benefits, long-overdue housekeeping, and a good chance to implement some features people have been asking for for a long time.

This effort has not been easy, and while I'm back at university, things have been a hard balance. However, I have many of the internals done already (configuration, core parts of the command line, specification structures), and I feel good progress is being made.

I will be making this effort more public once I have a solid foundation in place. For now, the latest commits to this pull request mark the last Python version of the project, unless some major breakthrough in communication happens.

Let me know if you have any furher questions. As always, I am incredibly grateful and happy to see that people are still interested in proselint. Things were rough, with stagnant development after communications ceased some years ago, but I am excited to finally have the chance to revive this project.

Nytelife26 commented 1 month ago

Small victories are showing - the command line, configuration parser, and check primitives are operational. As an additional bonus, all of the check specifications will be evaluated and stored at compile-time, entirely eliminating runtime discovery costs from the Python version.

Shown here is the first test with an actual regex specification from the original code.

Some things are not yet possible for reasons beyond my control, like consistent case-insensitive matching without using hacky mode modifiers (blocked by fancy-regex#132). I also have yet to implement parallelization, but that will not be a priority until much of the other major work is complete; although, it should be as trivial as adding Rayon and adjusting the dispatch iterators.

Nytelife26 commented 2 weeks ago

I committed a preview version today so any onlookers can see how things are coming along. Be advised that at present, things are messy, quite inefficient, and there are still traces of Python-esque design patterns lying around.

However, results speak for themselves. With 51 of ~180 checks implemented, an uncached serial run in release mode is at least 10 times faster than the previous implementation of an uncached parallel run mode from my measurements. Assuming this performance will scale in a linear fashion with some pessimistic padding, I would expect no worse than 2 times faster.

nytelife26@[lilium-2] » proselint git:(dev) ‎‎‎± hyperfine
Benchmark 1: ./proselint-rs/target/debug/proselint check --demo
  Time (mean ± σ):     529.2 ms ±   6.0 ms    [User: 520.3 ms, System: 7.0 ms]
  Range (min … max):   520.4 ms … 542.0 ms    10 runs

Benchmark 2: ./proselint-rs/target/release/proselint check --demo
  Time (mean ± σ):      77.0 ms ±   4.8 ms    [User: 73.9 ms, System: 1.9 ms]
  Range (min … max):    70.3 ms …  85.2 ms    10 runs

Benchmark 3: pdm run proselint check --demo
  Time (mean ± σ):     939.7 ms ±  15.3 ms    [User: 1174.2 ms, System: 232.7 ms]
  Range (min … max):   920.6 ms … 972.2 ms    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./proselint-rs/target/release/proselint check --demo ran
    6.88 ± 0.43 times faster than ./proselint-rs/target/debug/proselint check --demo
   12.21 ± 0.78 times faster than pdm run proselint check --demo

Updated results with a parallel iterator via rayon:

Benchmark 1: ./proselint-rs/target/debug/proselint check --demo
  Time (mean ± σ):     128.8 ms ±  19.7 ms    [User: 725.1 ms, System: 68.3 ms]
  Range (min … max):   103.3 ms … 162.5 ms    10 runs

Benchmark 2: ./proselint-rs/target/release/proselint check --demo
  Time (mean ± σ):      38.5 ms ±   3.2 ms    [User: 76.1 ms, System: 33.7 ms]
  Range (min … max):    33.2 ms …  43.4 ms    10 runs

Benchmark 3: pdm run proselint check --demo
  Time (mean ± σ):     838.4 ms ±  17.4 ms    [User: 1006.9 ms, System: 185.3 ms]
  Range (min … max):   813.6 ms … 865.9 ms    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./proselint-rs/target/release/proselint check --demo ran
    3.34 ± 0.58 times faster than ./proselint-rs/target/debug/proselint check --demo
   21.76 ± 1.89 times faster than pdm run proselint check --demo

All check specifications are registered at compile-time. Things that remain to be done include message templating, output formats, deciding whether I'd like CheckType to remain as an enum or become a trait to emulate the flexibility of Python's unions, implementation of a new ExistenceFancy check type, and general housekeeping.

amperser / proselint

refactor: modernisation of proselint #1371