Fixit can be incredibly slow. Are there any recommendations available for profiling checks?

Instagram / Fixit

Advanced Python linting framework with auto-fixes and hierarchical configuration that makes it easy to write custom in-repo lint rules.

https://fixit.rtfd.io/en/latest/

Other

659 stars 62 forks source link

Fixit can be incredibly slow. Are there any recommendations available for profiling checks? #457

Open jarshwah opened 2 months ago

jarshwah commented 2 months ago

As per the title, we're seeing incredibly long run times when running fixit==2.1.0.

Our code base is approx 4M lines of python code. On a Circle CI Large instance (4 CPU / 8gb), it takes up to an hour to run across the entire code base. We've worked around this by only running on changed files, which takes approximately 1 minute, except when we edit the fixit config or any of our custom rules where we'll run the entire suite.

I'm keen to try and understand why the runs may be slow by profiling in some way. Is there any tooling or logging available within fixit itself that might help to track down where some of the problems may be? To be clear, I'm coming at this from the assumption that the way we've written our rules is mostly to blame, and would like to identify the worst offenders.

While I'm not really able to share our implementations, I can provide some details of our current usage (if that might prompt some thoughts of "oh, of course that would be slow!":

We make heavy use of QualifiedNameProvider and FilePathProvider. We have lots of helper functions that presumably construct lots of instances, like so:

def is_django_model_class(visitor: LintRule, node: cst.ClassDef) -> bool:
    model_class = QualifiedName(name="django.db.models.Model", source=QualifiedNameSource.IMPORT)
    for base_class in node.bases:
        if QualifiedNameProvider.has_name(visitor, base_class.value, model_class):
            return True
...

def visit_ClassDef(self, node: cst.ClassDef) -> None:
    is_django_model_class = helpers.is_django_model_class(self, node)
    ...

Still, I'm more interested in being able to analyse performance and profiling data myself rather than guessing, so anything in this space would be incredibly helpful.

jarshwah commented 2 months ago

Somewhat answering my own question, fixit can generate debug logs for how long each execution of each plugin hook takes. But the format of those logs isn't very useful with a large number of files (and, tbh, figuring out how to configure the logging framework to output those logs wasn't something I tried to do).

LintRunner.collect_violations provides a timings_hook that can be used to collect timings but it isn't exposed in the main CLI. Installing fixit as an editable package locally, patching it, and running can provide a lot more detail:

diff --git a/src/fixit/engine.py b/src/fixit/engine.py
index 2e56dda..7a5f5ef 100644
--- a/src/fixit/engine.py
+++ b/src/fixit/engine.py
@@ -48,6 +48,20 @@ def diff_violation(path: Path, module: Module, violation: LintViolation) -> str:
     )

+def timing_hook(timings: Timings) -> None:
+    """
+    Log timings for each rule.
+    """
+    split_timings = {}
+    for name, duration in timings.items():
+        rule, hook = name.split(".")
+        split_timings.setdefault(rule, {})[hook] = duration
+    for rule, durations in split_timings.items():
+        for hook, duration in durations.items():
+            print(f'"{rule}","{hook}",{duration}\n', end="")
+
+
+
 class LintRunner:
     def __init__(self, path: Path, source: FileContent) -> None:
         self.path = path
@@ -67,6 +81,8 @@ class LintRunner:
         lint rules have finished running, passing in a dictionary of
         ``RuleName.visit_function_name`` -> ``duration in microseconds``.
         """
+        if timings_hook is None:
+            timings_hook = timing_hook

         @contextmanager
         def visit_hook(name: str) -> Iterator[None]:

$ fixit lint src > ~/Documents/fixit-rule-timing.csv

Then analyse with pandas:

import pandas as pd

data = pd.read_csv("~/Documents/fixit-rule-timing.csv", names=["plugin", "hook", "duration"])
# Convert duration to seconds
data["duration"] = data["duration"].div(1000000).round(2)
# Agg by plugin+hook
agg = data.groupby(["plugin", "hook"])["duration"].sum()
top_N = agg.nlargest(20)
print(top_N)

thatch commented 2 months ago

There are a few common paths that I know are slow, but you might be hitting some others depending on what your rules are using. I appreciate #458 with the good repro, I'll give some general thoughts here and then try to look deeper on that one.

I hacked up a quick draft of tracing that works for single-process mode only so far. Here's a screenshot of a sample run (pip install keke; fixit --trace=... --stats fix src/fixit) so we have something to discuss.

Most of the per-file time with this configuration is not spent in rules (tiny unnamed bars below visit_batched), which was surprising to me. It's spent in libcst internals (the unattributed time below visit_batched) and a fairly large time is also spent in GC (the yellow).

I suspect this is in metadata calculation (e.g. scope or qualified name) but that is probably more of a design inefficiency in libcst than something that's immediately fixable. Giving @zsol a heads-up that a keke PR for libcst might be on the way.

Good news (if you want to look at it that way) is that this is cpu-bound (the green at the top is actually a graph, pegged at 100%) and that there doesn't immediately appear to be any external bottleneck like I/O.

For a larger number of files, you probably also start to notice the trailrunner.walk call time and perhaps gitignore filtering, which is serial before it starts running rules, or the time to diff or transfer the results from multiprocessing. For small numbers of files (>1,<10) you will probably also notice the time to create processes (it uses spawn, which has to redo imports).

For fixit fix (vs fixit lint) you will also only be running single-threaded. This could be improved for fixit fix -a mode, but I'd want more tests before landing a trivial one-line PR to do this.

zsol commented 2 months ago

Happy to merge a keke PR for libcst, I was about to release a new version tomorrow but can wait a bit

jarshwah commented 2 months ago

Thanks for the response @thatch 👍 keke is new to me, I should have just reached for a profiler, it would have quickened some of my digging and patching.

Good news (if you want to look at it that way) is that this is cpu-bound

Yes, I agree, and that aligns with most of my testing too.

For a larger number of files, you probably also start to notice the trailrunner.walk call time and perhaps gitignore filtering

Yah, but at 6 seconds for 40,000 files with gitignore filtering as a once off cost is not horrible, though I did notice that trailrunner will still traverse into the top level of ignored directories like __pycache__ - the gitignore filtering only appears to work when matching against files. I'm unsure yet if it'll keep descending into subdirectories (think node_modules) but I think it does - since we ended up adding an exclusion rule for our node_modules directories.

I've updated https://github.com/jarshwah/experiments-fixit today with some more analysis, which essentially agrees that the bulk of the work is occurring within fixit_bytes (half the time, at least). There is still some medium hanging fruit outside of that though:

configs are generated for every path - I have a WIP change to pre-generate the configs. I'm not sure there'll be much benefit to this since it involves re-walking the tree of files.
We re-compute the rules for every path - these could be cached per-config if we compute the types and instantiate them later.
We can chunk the tasks with multiprocessing, but I didn't get many gains from that. https://github.com/jarshwah/Fixit/pull/1 covers options 2/3 (sort of..)

thatch commented 1 month ago

I have further trace info available if you install

keke-tracing/keke (branch tmp-multiprocessing)
thatch/libcst (branch tmp-multiprocessing)
thatch/fixit (branch tmp-multiprocessing)

I'm not sure any of these are ready to merge, but they provide a starting point to understand where the time comes from.

Responding to your theories:

configs for every path -- only 4ms on my machine per file, not the slow part
rules for every path -- only 2ms on my machine per file, not the slow part
chunk with multiprocessing -- I don't follow, likely not the slow part

Where I think the time is spent:

slow

My current thoughts:

parse_module is about 400ms in this view for one file
the various metadata providers (the things under _gen_batchable -- they are a chain and can't actually be batched, about 75% of the per-file time)
_get_visitor_methods being recalculated for every file, about 70ms in this view, the magenta under cyan just to the right of the last _gen_batchable

Note I don't currently have gc time spans working for multiprocessing workers, and would guess it accounts for ~10% of the total time (and likely the "NoR" oddity near the right side is a gc pause). In addition, collect_rules in each process is about the first 550ms (using spawn, this work needs to be reimported in each child, but only paid once).

jarshwah commented 1 month ago

Responding to your theories:

configs for every path -- only 4ms on my machine per file, not the slow part

4ms per file, for 40,000 files, is still 160 seconds.

rules for every path -- only 2ms on my machine per file, not the slow part

As above, this is per-file, so accounts for 80 seconds total.

chunk with multiprocessing -- I don't follow, likely not the slow part

As per https://github.com/jarshwah/experiments-fixit?tab=readme-ov-file#multiprocessing-pool chunking the tasks across multi-processing rather than serializing/deserializing per-task can be more performant in some cases, though I didn't see much of an improvement here, likely because so much time is spent elsewhere. Might be worth re-examining once other avenues are sorted.

While I agree these are not the primary bottlenecks and investigation is best spent on the functions you've identified, the first two are still doing lots more work than they need to, and could be tackled separately (which I'm happy to help with where time permits).

thatch commented 1 month ago

4ms per file, for 40,000 files, is still 160 seconds.

User time, sure. But not wall time -- this should parallelize well. If you have ways to improve this, go for it.

jarshwah commented 2 weeks ago

Sorry coming back to this:

parse_module is about 400ms in this view for one file

I can't replicate this, even with a much larger module or with the pure python parser.

import os
import pathlib
from libcst import parse_module

os.environ["LIBCST_PARSER_TYPE"] = "native"  # no-op, but it isn't "pure"
path = pathlib.Path("libcst/_parser/entrypoints.py")

def open_and_parse(path):
   content = path.read_bytes()
   module = parse_module(content)
   return module

%timeit mod = open_and_parse(path)
3.68 ms ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

os.environ["LIBCST_PARSER_TYPE"] = "pure"  # pure python parser

%timeit mod = open_and_parse(path)
10.6 ms ± 353 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I suspect that 400ms parse time is for all 40,000 files? For a relatively simple but non-empty module, I see approximately 3-4ms per-file with the rust-based parser.