dazinator / DotNet.Glob

A fast globbing library for .NET / .NETStandard applications. Outperforms Regex.
MIT License
363 stars 27 forks source link

Lots of bugs #88

Closed aiurovet closed 2 years ago

aiurovet commented 2 years ago

Summary (I did not check on Linux, but hopefully, matching there is case-sensitive):

Details (YES = all good):

Current Directory: C:\Users\AlexIurovetski\Projects\Dabble\Dabble\bin\Debug\net6.0

Glob:  c:\a\b\..\b\*.cs
Path:  c:\a\b\c.cs
Match: -- NO --

Glob:  **/*.cs
Path:  c:\a\b\c.cs
Match: YES

Glob:  **\*.cs
Path:  c:\a\b\c.cs
Match: YES

Glob:  **.cs
Path:  c:\a\b\c.cs
Match: -- NO --

Glob:  \**\*.cs
Path:  c:\a\b\c.cs
Match: -- NO --

Glob:  *.{cs,txt}
Path:  c:\a\b\c.cs
Match: -- NO --

Glob:  **/{*.md,src/*.cs}
Path:  c:\a\b\c.cs
Match: -- NO --

Glob:  **[/\]*.cs
Path:  c:\a\b\c.cs
Match: -- NO --
aiurovet commented 2 years ago

Here are the launch settings:

{
  "profiles": {
    "Dabble": {
      "commandName": "Project",
      "commandLineArgs": "-f c:\\a\\b\\c.cs -p c:\\a\\b\\..\\b\\*.cs -p **/*.cs -p **\\*.cs -p **.cs -p \\**\\*.cs -p *.{cs,txt} -p **/{*.md,src/*.cs} -p **[/\\]*.cs"
    }
  }
}

Here is the code I used:

using DotNet.Globbing;

class Program
{
    static string ParseValue(string[] args, ref int argNo, bool isOption = false)
    {
        if (isOption)
        {
            if ((++argNo) >= args.Length)
            {
                throw new Exception($"No value not found beyond arg #{argNo - 1}");
            }
        }
        else if (argNo >= args.Length)
        {
            throw new Exception($"Value expected at arg #{argNo}");
        }

        return args[argNo];
    }

    public static void Main(string[] args)
    {
        try
        {
            Run(args);
        }
        catch (Exception ex)
        { 
            Console.WriteLine($"\n*** ERROR: {ex.Message}");
        }
    }

    public static void Run(string[] args)
    {
        Console.WriteLine($"Current Directory: {Directory.GetCurrentDirectory()}\n");

        var argc = args.Length;

        if (argc < 2)
        {
            Usage();
        }

        var glob = Glob.Parse(" ");
        var path = string.Empty;

        for (int argNo = 0; argNo < argc; argNo++)
        {
            var arg = args[argNo];

            switch (arg.ToLower())
            {
                case "-p":
                    glob = Glob.Parse(ParseValue(args, ref argNo, isOption: true));
                    break;
                case "-f":
                    path = ParseValue(args, ref argNo, isOption: true);
                    break;
                default:
                    path = ParseValue(args, ref argNo);
                    break;
            }

            ShowResult(glob, path);
        }
    }

    public static void ShowResult(Glob glob, string path)
    {
        if (!string.IsNullOrWhiteSpace(glob.ToString()) && !string.IsNullOrEmpty(path))
        {
            Console.WriteLine($"Glob:  {glob}\nPath:  {path}\nMatch: {(glob.IsMatch(path) ? "YES" : "-- NO --")}\n");
        }
    }

    public static void Usage()
    {
        Console.WriteLine("Usage: dabble -p pattern1 -f file1 [-f file2 [-p pattern2] [-p pattern3 ...]");
        Environment.Exit(1);
    }
}
dazinator commented 2 years ago

Hi - thanks for opening an issue. Some / most of these are expected to fail and I don't consider them bugs, but I'll address each below.

Glob: c:\a\b\..\b\*.cs Path: c:\a\b\c.cs Match: -- NO --

DotNet.Glob does not interprete '.' or '..' inside patterns as "navigation". It treats thems literally. This is intended.

The idea behind DotNet.Glob is it does pattern matching against a string, but does not assume the string is a file system string and is somewhat decoupled from the file system/s for that reason. For the same reason it won't interpret ~ in patterns or any other "file system shortcut" either. A number of users use it for matching windows registry paths for example, or even url paths. Each of these systems of navigation can technically have its own rules so I believe DotNet.Glob is not the right place to evaluate them.

Why: separation of concerns. Glob pattern is for matching strings, file system paths are for evaluation by a file system which has its own rules and shortcuts. Solution: In this specific case, the glob pattern should be c:\a\b\*.cs e.g the file system navigational stuff should be applied so the pattern is "normalised".

Glob: **.cs Path: c:\a\b\c.cs Match: -- NO --

I think this is expected too. Looking at the readme.md inside this repo, ** if used must be the only content of a segment. Solution: The "correct" pattern according to rules implemented would be: **\*.cs

Glob: \**\*.cs Path: c:\a\b\c.cs Match: -- NO --

I'll double check this one as I believe there is some history/ justification around it. I know if you omit the starting slash - so use **\*.cs it will work. I believe the answer is, if you start the pattern with a directory separator, DotNet.Glob always expects to match that first, (so it expects to find a directory seperator at the start of any string being matched) before evaluating the rest of the pattern. In this case the string being matched doesn't start with any directory separator- e.g c:\.

Glob: *.{cs,txt} Path: c:\a\b\c.cs Match: -- NO --

Glob: **/{*.md,src/*.cs} Path: c:\a\b\c.cs Match: -- NO --

For the above two cases - this is braces are not supported. There is an an issue about adding this as a feature but not yet a great deal of demand. So this is expected and not a bug per say, more lack of a requested feature.

Workaround for now would be to expand this out to the multiple distinct glob patterns it contains and evaluate your target string through these using "OrElse" style boolean logic... e.g *.{cs,txt} becomes two different glob patterns *.cs and *.txt and these glob patterns can be invoked to match the target string one then the other - only if atleast one succeeds is the match considered a success in your application.

Glob: **[/\]*.cs Path: c:\a\b\c.cs Match: -- NO --

Same issue I believe as another above - ** can only be used as full content of a segment. In this case you've got the full pattern in one single segment as [/\] is matching literals. I'm not sure what you are trying to achieve with this pattern so perhaps you can elaborate a little on this one so I can be more certain of any response.

IsPathSeparator(...) is correct for Windows, but wrong for Linux, as back-slashes are not considered separators. There is even a possibility that the forward-slash is escaped (preceded with the odd number of back-slashes: 1, 3, 5...), hence not considered as a separator

There is no issue that I am aware of on Linux. Tests are run on Linux. Slash direction difference is obviously something I am aware of. The tokenizer produces a Path seperator token when it parses a directory seperator character and this can be a Slash in either direction as far as I recall. If you can repro any issue around this please share and I'll look further.

aiurovet commented 2 years ago

Thanks for reviewing my objection points. I understand that you prefer separation of concerns, and therefore, you separate glob matching from filesystem searches. But actually, you don't, as you treat / and \ as path separators and give ** special treatment.

On the other hand, my point is that glob patterns are used primarily for such searches, and the library does not work as expected. For instance, no command-line utility which expects glob patterns to filter files/directories will work with the library without significant workarounds. And no support for braces in this aspect is a huge drawback. If you announce glob pattern matching you should ensure all file searches are working as per standard.

By the way, the tilde ~ in Linux is not considered as a shorthand for home directory anywhere except in certain shells. It gets expanded to the home directory (when not enclosed in quotes) before being passed as an argument to shell scripts or binary executables. The same applies to single * and ?, but not on Windows. Unlike wildcards though, the tilde is not used on Windows at all and does not require special support.

According to the standard, **/*.txt should find somedir/abc.txt, but not abc.txt. To achieve finding both **.txt should be used. See globstar at https://www.gnu.org/software/bash/manual/html_node/The-Shopt-Builtin.html

About IsPathSeparator(): if tests under Linux work fine, it means that either I misunderstood the code, or the tests don't cover all cases.

In the source file GlobStringReader.cs I saw:

...
private static readonly char[] PathSeparators = { '/', '\\' };
...
public static bool IsPathSeparator(char character)
{

    var isCurrentCharacterStartOfDelimiter = character == PathSeparators[0] ||
                                             character == PathSeparators[1];

    return isCurrentCharacterStartOfDelimiter;

}
...

In my opinion, this means that both characters are treated as path separators on both kinds of filesystems. However, this is wrong for Linux: the string like "abc\\def.txt" there does not represent directory and filename, but rather just a filename. On the other hand, on Windows, both "abc\\def.txt" and "abc/def.txt" represent the same directory and filename.

dazinator commented 2 years ago

But actually, you don't, as you treat / and \ as path separators and give ** special treatment.

If it helps I used this as a guide: https://en.m.wikipedia.org/wiki/Glob_(programming)#Syntax

As you can see I've implemented basic stuff following this as a sort of spec to help work out what are essential "agreed upon" globbing features.

** was a feature I chose to add as its is very common and useful, and is mentioned on that wiki under 'globstar'.

As you can also read the documentation is littered with "normally" or "traditionally" etc etc. Basically there appear to be many variations at different levels. For example it's traditional that patterns don't match dotfiles on Linux unless the pattern starts with a .. However I have to make judgement calls over delivering a functional glob library that works in predictable way accross platforms and not coupled to just file system paths. Where there is some sort of expected behaviour like this for a certain platform or experience- I may chose to add it as a feature later behind some sort of feature flag, like how I deal with enabling or disabling case sensitivity option for example.

Note from the wiki, this section with respect to brace expansion and parenthesis etc

Some shells (such as the C shell and Bash) support additional syntax known as alternation or brace expansion. Because it is not part of the glob syntax, it is not provided in case. It is only expanded on the command line before globbing.

The Bash shell also supports the following extensions:[9]

Extended globbing (extglob): allows other pattern matching operators to be used to match multiple occurrences of a pattern enclosed in parentheses, essentially providing the missing kleene star and alternation for describing regular languages. It can be enabled by setting the extglob shell option. This option came from ksh93.[10] The GNU fnmatch and glob has an identical extension.[3]

What I took away from this is that these features are not "core" to globbing but I may chose to add support for some of them in future. This is why I've opened an issue to support them in future.

With respect to linux, thanks for elaborating. That does sound like a bug so I will look into that and add some test cases.

According to the standard, */.txt should find somedir/abc.txt, but not abc.txt. To achieve finding both **.txt should be used. See globstar at https://www.gnu.org/software/bash/manual/html_node/The-Shopt-Builtin.html

Take a look at this (like the Bash example) as it seems contradictory to your assertion?: https://github.com/fish-shell/fish-shell/issues/7222#issuecomment-663890930

I've mentioned in the readme that my implementation of ** must be the only content of a segment in a pattern so **.txt for example is not valid according to that rule. Quite where I got that rule from: I am unsure now... pretty sure I read it somewhere but the reference material now eludes me. However in the issue linked above they also make reference to this constraint being a common one.

dazinator commented 2 years ago

Just to circle back to this point because this seems the crux of your expectations and concerns:-

On the other hand, my point is that glob patterns are used primarily for such searches, and the library does not work as expected. For instance, no command-line utility which expects glob patterns to filter files/directories will work with the library without significant workarounds. And no support for braces in this aspect is a huge drawback.

The library not working as expected does also greatly depend on what your expectations are.. People can often have varying expectations. If you read the readme (which tries to set expectations by describing how to use the library and the globbing syntax / patterns it supports, then it may more closely align with your expectations when you try to consume it.

The support for braces is part of the extended globbing syntax issue / feature that I plan to address in future. On first glances I know from reading up that its possible that some shells used to expand the contents of braces and end up passing the individual glob patterns to the underlying command. So in the case you mention, the command line application would receive the multiple glob patterns from the shell and not the braces.. Due to this kind of thing it didn't initially seem to me to be an inherent feature of glob matching itself but more of a feature of a shell or application above the glob matching library.. so this wasn't on my critical path as a feature when rolling out this library. However I can see how this would be a beneficial feature to support so hopefully I'll get to looking at that one rainy day in the future.

If you announce glob pattern matching you should ensure all file searches are working as per standard.

I've "announced" a glob pattern matching library along with specifically what it does support and how to use it, with tests to help ensure it works as i've stated. "You should ensure all file searches working as per standard" is a very broad statement. It really depends on factors such as:-

  1. What patterns / syntax do you consider should be supported as "standard"
  2. Are there patterns that behave slightly differently on different platforms / shells etc - how do you reconcile that conflicting behaviour to deliver what users will most expect.
  3. Do you need to enable support of all patterns of all globbing use cases in existence ? Or start with most useful / canonical ones and slowly add features over time in a carefully considered mannor?
  4. Sometimes a spec may say one thing, but if many popular tools implement it slightly differently - users expect the common behaviours despite what the spec said. E.g need to be pragmatic about things and test how things behave in the wild.

There are more factors, but essentially, I've taken the approach that to produce something useful I'll deliver on the basic canonical stuff first and go slowly - leaving things in the wild for long periods, letting time for user feedback or issues to come in, tests to be extended - allowing discussions - like this one, that can inuence / shape how things might evolve in future or what things need attention next!

aiurovet commented 2 years ago

Mainly, I wanted to draw attention to the issues which arise when using glob patterns for filtering files. This is primary use case. Something like a command-line application which expects files to process: customapp ..\otherdir\*.txt or ./customapp "/rootdir/lib*/**.{a,obj}" and so on. You can close the ticket.

dazinator commented 2 years ago

Thanks it's definitely value feedback. I will consider things I can do to enable this scenario better in future. I'll open a more targeted issue for now about the Linux filename slash issue.