PyCQA / isort

A Python utility / library to sort imports.
https://pycqa.github.io/isort/
MIT License
6.49k stars 580 forks source link

skip_gitignore=true will break ignoring directories defined in skip during filesystem traversal #1912

Open richardebeling opened 2 years ago

richardebeling commented 2 years ago

Description

When starting isort with --skip_gitignore, directories that should not be traversed because they are configured in skip will be traversed.

From what I've seen, this is purely a performance problem, so all files that should be skipped will be filtered out at some later point.

Instructions to reproduce

Minimal setup (note: "node_modules" is in the default skip values.)

# assuming isort 5.10.1 is installed (`pip3 install isort`)
mkdir test && cd test
git init
mkdir -p node_modules/this_should_not_be_accessed
touch test.py
strace isort . 2>&1 | grep this_should_not  # as expected, not queried
strace isort --skip-gitignore . 2>&1 | grep this_should_not  # suddenly, stat and open syscalls appear

On a side note: Add more files, and the node_modules folder will be queried more often:

touch test2.py
strace isort --skip-gitignore . 2>&1 | grep this_should_not  # 2 stat and 2 open calls
touch test3.py
strace isort --skip-gitignore . 2>&1 | grep this_should_not  # 3 stat and 3 open calls

adding node_modules to a .gitignore doesn't help, even though the directory should then be double-ignored by isort (this is #1895).

echo "node_modules" >> .gitignore
strace isort --skip-gitignore . 2>&1 | grep this_should_not  # 3 stat and 3 open calls

Environment & Versions

Run on ubuntu21.10 with python3.9.7. Actual syscalls probably depend on the python interpreter, if you can't reproduce, I'll probably be able to figure out what python code causes the access.

Related Issues

Closely related to #1895, but if I undestand that issue correctly, it does not talk about skip at all, but just files from .gitignore. ~It seems to me that the current PR for #1895, #1900, would fix this.~ Possibly related #1762 and #1777, both read as if they were fixed.

Background

Yesterday, I realized that isort takes >90 seconds to run on one project where it processes 107 files with approximately 10k lines of code. Turns out, it doesn't need 90 seconds to process, but rather 89s to traverse what feels like the entire file system, only to then do its actual work in ~1s:

time isort --show-files . > isort_output.txt
real  1m35.831s

time cat isort_files.txt | xargs isort
real 0m0.896s

which is absurd, considering that black also finds the code files and finishes in 3 seconds. I wasn't able to get isort to skip reading directories it really doesn't need to access (venv, node_modules, web server static files, ...)

bmalehorn commented 2 years ago

Yeah, this is highly related to https://github.com/PyCQA/isort/issues/1895 and probably resolved by https://github.com/PyCQA/isort/pull/1900.

It sounds like your specific issue is:

  1. isort has a default ignore list, including node_modules
  2. when you pass in --skip-gitignore, it actually runs slower because it reads every file then filters afterwards
  3. isort is still functionally correct when you pass in --skip-gitignore, it just takes too long
richardebeling commented 2 years ago

1900 is now merged, but iirc, we still have cases where we traverse through directories that are ignored, see my comment on the PR and the response to it.

adamjstewart commented 2 years ago

Just encountered this same problem with isort 5.10.1. I have a data directory with millions of files. My .gitignore includes /data/ as one of the directories to skip. I see the following performance metrics:

$ isort --skip-gitignore .  # 23 min
$ isort --extend-skip data .  # 43 sec

If I delete the data directory, isort . takes < 2 seconds. Both 23 min and 43 sec are prohibitively long, I usually end up just not using isort and manually sorting my imports 😢