facelessuser / wcmatch

Wilcard File Name matching library
https://facelessuser.github.io/wcmatch/
MIT License
131 stars 13 forks source link

Windows separators not recognized in matching patterns on Windows by default #194

Closed kdeldycke closed 2 years ago

kdeldycke commented 2 years ago

Looking for TOML files on my macOS filesystem, I'm using the following code:

Python 3.10.6 (main, Aug 30 2022, 04:58:14) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin

>>> import wcmatch
>>> wcmatch.__version__
'8.4'

>>> from wcmatch.glob import glob, GLOBSTAR
>>> glob("**/*.toml", flags=GLOBSTAR)
['pyproject.toml', '.github/dependabot.yaml',
(...)

Works as intended. Now the same from the root of the filesystem:

>>> glob("/**/*.toml", flags=GLOBSTAR)
['/System/Volumes/Data/Users/me/furo/pyproject.toml', '/System/Volumes/Data/opt/homebrew/Cellar/rust/1.63.0/lib/rustlib/src/rust/library/rustc-std-workspace-core/Cargo.toml',
(...)

Also works fine. 👍

Let's switch to Windows:

Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] on win32

>>> import wcmatch
>>> wcmatch.__version__
'8.4'

>>> from wcmatch.glob import glob, GLOBSTAR
>>> glob("**/*.toml", flags=GLOBSTAR)
['pyproject.toml', '.github\\dependabot.yaml',
(...)

Also seems to work. Even if I used a / unix-separator in the pattern. Note how a double-slash \\ is used in the results.

But if I use this notation in the pattern, it is not interpreted as a separator:

>>> glob("**\\*.toml", flags=GLOBSTAR)
[]

I was surprised by the result as my first expectation was for the pattern to be platform-dependent, and adopting the platform convention by default. Thinking of it I'm pretty sure this is not a bug, as it allows for a single code base to set a matching pattern using only one syntax (the Unix flavor), and let wcmatch sort out the specifics of the platform it is run onto.

Feel free to close this issue if you confirm this is not a bug. Still wanted to report this behaviour just in case.

One last nitpick, specifying a drive on Windows using Unix separators works, but the drive slash is not normalized:

>>> glob("c:/**/*.toml", flags=GLOBSTAR)
['c:/Users\\me\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\click-extra-wXuJs52Z-py3.10\\Lib\\site-packages\\tabulate-stubs\\METADATA.toml', 'c:/Users\\me\\click-extra\\pyproject.toml',
(...)

See how all path starts with c:/Users\\me, where I was expecting c:\\Users\\me (i.e. what pathlib.Path normalizes to).

facelessuser commented 2 years ago
>>> glob("**\\*.toml", flags=GLOBSTAR)
[]

Did you forget to use a raw string?

Remember, wcmatch supports escaping characters with backslash on both Linux/Unix systems and Windows, but also remember that normal Python strings will treat \\ as a single \. So r'\[' will match a literal [ and r'a\\b' will match a literal separator on Windows, but notice I'm using a raw string. 'a\\b' is a normal string and python will treat it as a\b and wcmatch will think you are trying to escape b.

We note this as the first tip under syntax: https://facelessuser.github.io/wcmatch/glob/#syntax.

We later go on to say:

  • Slashes on Windows are normalized. / will match both / and \\. There is no need to explicitly use \\ in patterns on Windows, but if you do, they must be escaped to specify a literal \\. If a backslash is escaped, it will match all valid windows separators, just like / does.

I don't think we necessarily do a good enough job driving this point home though, so maybe I should add some actual examples for Windows.

Hopefully, that makes sense.

facelessuser commented 2 years ago

Just to further clarify, it would have worked if you used double escapes or used raw strings:

>>> glob(r"**\\*.toml", flags=GLOBSTAR)
>>> glob("**\\\\*.toml", flags=GLOBSTAR)
facelessuser commented 2 years ago

For the sake of clarity, we'll add a dedicated section to Windows separators with the following text and examples:

Windows Separators

On Windows, it is not required to use backslashes for path separators as / will match path separators for all systems. The following will work on Windows and Linux/Unix systems.

glob.glob('docs/.*')

With that said, you can match Windows separators with backslashes as well. Keep in mind that Wildcard Match allows escaped characters in patterns, so to match a literal backslash separator, you must escape the backslash. It is advised to use raw strings when using backslashes to make the patterns more readable, but either of the below will work.

glob.glob(r'docs\\.*')
glob.glob('docs\\\\.*')

Hopefully, this more clearly brings attention to and describes the expected behavior.

facelessuser commented 2 years ago

Documentation has been updated with more explicit information regarding Windows separators. If you still feel the behavior is unclear or buggy, please do let me know, I'm more than happy to discuss things further.

facelessuser commented 2 years ago

I've created a separate issue to track the Windows drive normalization of separators. I almost missed that. I'll verify and provide a fix as appropriate.

kdeldycke commented 2 years ago

Documentation is now crystal clear. It was a good idea to update it.

The Windows drive normalization is not an urgent issue, at least on my side, as I was feeding the resulting path to pathlib.Path anyway.

Thanks @facelessuser for the incredible fast reply! wcmatch is extremely helpful and saved me a lot of trouble and work!

facelessuser commented 2 years ago

Yep, no worries. Drive normalization is definitely cosmetic, but it should be a fairly easy thing to patch. I'm surprised I never noticed it before because that is definitely something that would normally bother me 😅.