Parallelize checking for 10x performance win

aatifsyed commented 11 months ago

What does this PR accomplish?

Improves checking times by ~10x

comparison with cargo-spellcheck 0.12.4 by running cargo spellcheck on https://github.com/rust-lang/rust/tree/22e9fe644ea710eec50cb0aabcae7fa8dd9fd675

(see below note for why I've done -j $(nproc)

$ hyperfine "cargo spellcheck -j $(nproc)" "~/code/cargo-spellcheck/target/release/cargo-spellcheck -j $(nproc)" --export-markdown hyperfine.md 
Benchmark 1: cargo spellcheck -j 16
  Time (mean ± σ):     356.515 s ± 11.933 s    [User: 409.724 s, System: 155.652 s]
  Range (min … max):   340.755 s … 374.962 s    10 runs

Benchmark 2: ~/code/cargo-spellcheck/target/release/cargo-spellcheck -j 16
  Time (mean ± σ):     32.953 s ±  2.922 s    [User: 353.613 s, System: 13.498 s]
  Range (min … max):   26.296 s … 36.698 s    10 runs

Summary
  '~/code/cargo-spellcheck/target/release/cargo-spellcheck -j 16' ran
   10.82 ± 1.03 times faster than 'cargo spellcheck -j 16'

Command	Mean [s]	Min [s]	Max [s]	Relative
`cargo spellcheck -j 16`	356.515 ± 11.933	340.755	374.962	10.82 ± 1.03
`~/code/cargo-spellcheck/target/release/cargo-spellcheck -j 16`	32.953 ± 2.922	26.296	36.698	1.00

🩹 Bug Fix
🦚 Feature

Closes # .

Changes proposed by this PR:

Parallelise running the checkers

Notes to reviewer:

Not sure why the previous checking code was single threaded - it was definitely attempting to be parallel, but misunderstood the stream APIs.

There are probably a bunch more perf improvements to make - suspect the next bottleneck is document collection.
Would be nice to use tracing, and have better (parallel) profiling tools.
Not sure why we use the number of physical cpus - hunspell is using only half the threads it can. We should use num_cpus::get instead of num_cpus::get_physical (or std::thread::available_parallelism).
clippy is very unhappy with doc-chunks
lots of methods on Document should probably be removed

📜 Checklist

[x] Works on the ./demo sub directory
[x] Test coverage is excellent and passes No change
[x] Documentation is thorough No change

drahnr commented 11 months ago

Thank you for your PR!

I indeed misread the API documentation of buffered.

Limiting to the physical CPUs was intended, most people run this on their local machines and still want to be able to work while the (then very long) checks were running.

Let me bench the PR, and then let's discuss :)

drahnr commented 11 months ago

My results appear to be very different on 40k loc project using the following command:

cargo b --release; hyperfine --show-output -r 2 "./target/release/cargo-spellcheck check -j16 --code 0 ../secret/project"

Current main branch

# git 8e4cfcce7f8cf3984664feba44e90dabe9d9fcf1
Benchmark 1: ./target/release/cargo-spellcheck check -j16 --code 0 ../gensyn/node
  Time (mean ± σ):     191.1 ms ±   5.7 ms    [User: 246.0 ms, System: 55.1 ms]
  Range (min … max):   187.1 ms … 195.2 ms    2 runs

# git 6c50e07efa303aa37fff73be1d7b187b36dc3e22
Benchmark 1: ./target/release/cargo-spellcheck check -j16 --code 0 ../gensyn/node
  Time (mean ± σ):     184.6 ms ±   3.2 ms    [User: 236.8 ms, System: 58.1 ms]
  Range (min … max):   182.4 ms … 186.9 ms    2 runs

which aligns with the CPU utilization I see - it's only one core regardless.

drahnr commented 11 months ago

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./cargo-spellcheck-8e4cfcce7f8cf3984664feba44e90dabe9d9fcf1 -j 16 --code 0 ../gensyn/node`	180.1 ± 6.3	170.1	191.2	1.01 ± 0.05
`./cargo-spellcheck-6c50e07efa303aa37fff73be1d7b187b36dc3e22 -j 16 --code 0 ../gensyn/node`	178.5 ± 6.7	171.0	191.0	1.00

hyperfine --show-output "./cargo-spellcheck-8e4cfcce7f8cf3984664feba44e90dabe9d9fcf1 -j 16 --code 0 ../gensyn/node" "./cargo-spellcheck-6c50e07efa303aa37fff73be1d7b187b36dc3e22 -j 16 --code 0 ../gensyn/node" --export-markdown hyperfine.md 
Benchmark 1: ./cargo-spellcheck-8e4cfcce7f8cf3984664feba44e90dabe9d9fcf1 -j 16 --code 0 ../gensyn/node
  Time (mean ± σ):     180.1 ms ±   6.3 ms    [User: 238.3 ms, System: 59.6 ms]
  Range (min … max):   170.1 ms … 191.2 ms    15 runs

Benchmark 2: ./cargo-spellcheck-6c50e07efa303aa37fff73be1d7b187b36dc3e22 -j 16 --code 0 ../x
  Time (mean ± σ):     178.5 ms ±   6.7 ms    [User: 235.5 ms, System: 57.5 ms]
  Range (min … max):   171.0 ms … 191.0 ms    15 runs

Summary
  ./cargo-spellcheck-6c50e07efa303aa37fff73be1d7b187b36dc3e22 -j 16 --code 0 ../gensyn/node ran
    1.01 ± 0.05 times faster than ./cargo-spellcheck-8e4cfcce7f8cf3984664feba44e90dabe9d9fcf1 -j 16 --code 0 ../x

I do understand that your code should be faster, but practically it's just about even. I think it'd be worth looking into a) why our outcomes are different and b) why there is only one CPU core used.

drahnr commented 11 months ago

@aatifsyed see the above monologue :)

aatifsyed commented 11 months ago

Interesting - just to clarify my understanding, on this branch you're seeing only a single core used?

drahnr commented 11 months ago

Interesting - just to clarify my understanding, on this branch you're seeing only a single core used?

Yes, exactly. The pattern differs, but there is only ever one CPU used at 100% with this branch. I suspected it being IO bound or a lock in stdout (hypothesis).

aatifsyed commented 11 months ago

That's very surprising.

I've prepared tracing on two branches for you: https://github.com/aatifsyed/cargo-spellcheck/tree/parallel-traced (my code) https://github.com/aatifsyed/cargo-spellcheck/tree/single-thread-traced (pre-existing code)

Running each will give you a trace-<unix_seconds>.json that you can view at https://ui.perfetto.dev, or chome://tracing in a chromium browser.

Here are the traces on my machine (for a bare cargo spellcheck invocation), on https://github.com/ChainSafe/forest/:

$ loc
--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 Rust                   313        51683         5975         5028        40680
 Markdown                27         4669          823            0         3846
...

parallel-traced (my code) you can see the multiple threads of execution, taking ~6s trace-parallel.json.txt

single-thread-traced (pre-existing code) you can see that there only a single thread, taking ~45s trace-single-threaded.json.txt

Want to try showing the traces from runs on your machine?

aatifsyed commented 11 months ago

I suspected it being IO bound or a lock in stdout (hypothesis).

I'd be surpised if this was the case!

drahnr commented 11 months ago

parallel.json.txt vanilla.json.txt

I am on fedora 38 AMD 64 with a 16 core processor. Are you on Mac OS X?

drahnr commented 11 months ago

forst-parallel.json.txt forest-vanilla.json.txt

When running on forest, I indeed see the ~10x speedup.

I'll have to debug why this is not the case for the private repo, but I don't see any obstacles to merging this as is.

Could you open up follow up PR introducing the tracing?

drahnr commented 11 months ago

Alright, I think I figured the root cause, it's rather unintuitive cli flag interpretation.

drahnr / cargo-spellcheck