dazinator / DotNet.Glob

A fast globbing library for .NET / .NETStandard applications. Outperforms Regex.
MIT License
363 stars 27 forks source link

More Performance Improvement Ideas and Heurisitcs #10

Closed dazinator closed 7 years ago

dazinator commented 7 years ago

When parsing a glob pattern I can do the following:

  1. Calculate the minimum required char length that a string needs in order to match the pattern overall.

For example, this pattern, would require a string atleasts 4 character's long in order to match. **/*.txt

This would require 9: */[a-z][!1-9]/f.txt

This is because certain tokens require atleast a single character in order to match, so you can sum that total up when analysing the glob pattern, and get a minimum required length for any set of tokens that a string needs to be in order to have the possibility of matching.

Some tokens (* and **) will match against 0 or many characters so they won't add any weight to the minimum required length.

With this information available, when matching strings against the Glob using Glob.IsMatch(somestring) I can allow the glob to fail much faster on certain strings using a simple string length check.

For example:

var glob = Glob.Parse("*some/fol?er/p*h/file.*")
Glob.IsMatch("aaaaaaaasome/foo")

That can fail pretty much straight away, becuase the computed min length of a matching string is 20 chars, and the test string is only 16 chars long. I can fail this without even attempting to match any of the tokens.

I expect to use this new length information for the IsMatch() evaluation, where you just want a bool result quickly. The Match() method is different because it actually returns more in depth analysis about the match, including which tokens failed to match and why. For example, in the case of a string aaaaaaaasome/foo and a pattern *some/fol?er/p*h/file.* it might be important to know that "some/" actually matches "aaaaaaaasome/" but that fol fails to match "foo" , and the closest it came to matching was "fo". This kind of in-depth match analysis can only be returned if the match is actually attempted, which means not failing fast due to length checks. However failing early is desirable when doing IsMatch() because a boolean result is all you want to know.

Once I have the min required length computed, I can also put in some improvements for the Wildcard(*) evaluator and WildcardDirectory(**) evaluators.

Those evaluators will now be able to match against characters only within a range where it doesn't take them past minimum length required for the remaining tokens to be matched.