Globing or file tree walking is not working

darioackermann commented 9 months ago

OS: Windows Python: 3.10

Setting up as described in the Readme leads to relint ** parsing absolutely no files.

glob.escape leads to glob.iglob returning with an empty set
a check verifying that path is a file is missing (add ... if os.path.isfile(path) i.e., modified main looks like this now

import os.path
# ...
    paths = {
        path
        for file in args.files
        for path in glob.iglob(file, recursive=True)
        if os.path.isfile(path)
    }
# ....

Still, when culprits are found, throws AttributeError: type object 'Syntax' has no attribute 'guess_lexer' It looks like there has been a major change to rich, and this package is not locking a major version of it.

codingjoe commented 8 months ago

Hi there 👋,

Thanks for reaching out. Can you elaborate a bit more on the issue you are experiencing? I appreciate that you dove directly into troubleshooting, but I am struggling to understand where that journey started.

Best Joe

ppmathis commented 7 months ago

@codingjoe I stumbled across a similar issue today while attempting to run relint ** locally - to make it clear, I'm not affected by the rich issue, but relint was also not matching any files. As pointed out by @darioackermann, currently the following code snippet is used when running relint:

https://github.com/codingjoe/relint/blob/64b402c639b3d729297640d8adae1e7cd15ed357/relint/__main__.py#L75-L79

As can be clearly seen, glob.escape is called on every file before being passed to glob.iglob. I do not understand the reason for using glob.escape though, as this will make it impossible to actually have any form of globbing when specifying files for relint.

The README says that relint ** will scan all files recursively, but this is actually not true and depends on the operating system and shell that has been used, as relint itself does not do any kind of globbing. In other words:

:x: - Case 1: When literally calling relint ** outside of a shell context, relint will have an argv[1] of **, escape it to [*][*] and pass that to glob.iglob which will look for a literal file called **, ending up doing nothing

:x: - Case 2: When calling relint ** in a shell which does not support double-asterisk globbing or has said option disabled (e.g. bash when not setting globstar), the shell will either pass a literal ** (= same result as case 1) or simply act as if * was passed, meaning that the shell will do a non-recursive lookup and execute relint like relint a.py b.py c.py .... relint will check some files, but only those at the current directory level.

✅ - Case 3: When calling relint ** in a shell which does support double-asterisk globbing, all files will be recursively looked up and relint will be called like relint a/a.py a/b.py a/c.py b/test1.py b/test2.py .... Here relint will do what ** was supposed to do, as it is now explicitly passed all the files to check.

This is IMHO rather unintuitive and not what I'd expect from a linter, as now the behaviour can differ depending on which context the linter is being called in. I would expect this behaviour from a linter:

relint . should recursively lint all files starting in the current directory (the preferred way, should be in the docs)
relint a.py b.py should just scan these two files which have been passed
relint some/folder should recursively lint all files in some/folder
relint crazy/pat?ern/*.py should use recursive globbing to lookup files
relint **, while a bit unusual, will continue to work and is now even recursive on "unsupported" shells

Replacing the code snippet further above with this one would implement exactly said functionality - this could obviously be split up into a helper function, but for brevity's sake, I made a single expression:

    paths = {
        file
        for path in args.paths
        for file in (
            glob.iglob(os.path.join(path, '**'), recursive=True)
            if os.path.isdir(path)
            else glob.iglob(path, recursive=True)
        )
    }

Please note that I also renamed args.files to args.paths, as the latter is a more fitting expression for specifying such paths. Now calling relint is no longer dependent on the specific operating system / shell that is used, and supports specifying files, directories or glob patterns.

Last but not least, I noticed that running relint on a huge project wastes a lot of time, as lint_file will always attempt to open a file, then check if the file pattern actually matches. This is very inefficient and I would propose this code instead:

def lint_file(filename, tests):
    for test in tests:
        if test.file_pattern.match(filename):
            break
    else:
        return

    try:
        with open(filename) as fs:
            content = fs.read()
    except (IsADirectoryError, UnicodeDecodeError):
        pass
    else:
        for test in tests:
            for match in test.pattern.finditer(content):
                line_number = match.string[: match.start()].count("\n") + 1
                yield filename, test, match, line_number

This provides a noticeable speedup on my system when running across a repository which also contains node_modules and similar directories with many small files. The original implementation usually takes around ~12 seconds to complete, whereas the improved version only takes around ~2.5 seconds.

Could you maybe check this out and, if you agree, implement these or similar changes to relint? It's a great tool, especially in combination with pre-commit, but the usability is a bit tricky with the current behaviour.

ppmathis commented 7 months ago

@codingjoe One more thing, somewhat related to this - the pre-commit hook is also semi-broken, as --git-diff is always passed to relint based on the current hook config.

Given that pre-commit itself will always pass the files which should be linted, either those which are about to be committed or all of them in batches, in case --all-files was specified, I would propose removing the args: [--git-diff] option. This is a much safer approach, as then relint is guaranteed to run on the same files that pre-commit expects, and there is no need for relint to duplicate the logic when being called from pre-commit.

As of today though, running pre-commit run --all-files simply does nothing in a project, as the --git-diff argument passed to relint causes it to simply do nothing instead of linting as its supposed to. Simply removing this one line from the hook config would restore proper functionality in all cases.

codingjoe commented 6 months ago

Thanks for the detailed response and the deep-dive. Honestly, I don't recall why the pattern is being escaped. It does defeat the purpose. That being said, since it never worked, I'd favor simply removing tree walking all together and relying on users to pass files names. IMHO, it's best to keep the scope of this package small and on point.

@ppmathis you make a point about pre-commit, but I'd encourage you to open a separate discussion for that.

codingjoe commented 6 months ago

Resolved in v 3.1.1

codingjoe commented 6 months ago

Resolved in v 3.1.1

codingjoe / relint

Globing or file tree walking is not working #70