dazinator / DotNet.Glob

A fast globbing library for .NET / .NETStandard applications. Outperforms Regex.
MIT License
363 stars 27 forks source link

Spanification #48

Closed cocowalla closed 5 years ago

cocowalla commented 6 years ago

With the recent 'spanification' of .NET Core, in particular the addition of the Span-based FileSystemEnumerable and System.IO.Path Span-based methods, I'm wondering if it would be possible to add Span-based methods to Glob? The point would be to avoid allocating strings when performing Glob-based matching.

dazinator commented 6 years ago

Interesting idea. I'd have to do some research on this. Always open to any PRs that improve perf.

dazinator commented 6 years ago

Few things that spring to mind.

DotNet.Glob uses the string length in a few places to shortcut processing. Suppose you ask it to match a glob pattern that has a min length requirement of 5 - e.g *12345* and you match this glob pattern against a few thousand strings. It won't bother evaluating any of those strings that are less than 5 in length - it can immediately reject them because they fail the min length heuristics. If we add support for span then need to consider if we lose this length informatioin and the ability to do this short circuiting. Could that lead to a perf loss in some circumstances?

Also I was thinking, perhaps a Span could be used to load each string into the same segment of memory, match against it there, then clear the span, and load in the next string into the same span. The idea is, can this be done, and would this save memory consumption when comparing against a large amount of strings / file names.

cocowalla commented 6 years ago

Not sure what you mean about losing the length information?

ReadOnlySpan<char> mySpan = "myergen".AsSpan();
var len = mySpan.Length; <--- = 7
dazinator commented 6 years ago

Ah right I see. Im missing something though. Could you spell out for me where the perf improvement would come into play - doesnt that still take an allocated string and then access it it in a span ? So in that sense its not avoiding any string allocation. Or am I missing something..

cocowalla commented 6 years ago

A typical use for globbing is matching paths in a file system, where you'd recurse down through a directory and call Glob.IsMatch on each path to see if it's a match. Until recently, that meant allocating rather a lot of strings - but with the new Span-based System.IO.Path APIs, it's actually possible to do this with very little allocations, which means that callers may have a span, rather than a string. I was thinking of overloading Glob.IsMatch:

public bool IsMatch(string subject)
public bool IsMatch(ReadOnlySpan<char> subject)

There would be quite a bit of work to get that working behind the scenes though.

I'm sure it's obvious, but I should add that this is a 'nice to have', rather than a 'must have' - Span and friends are in .NET Core 2.1, but won't make it into Standard until .NET Standard 2.1. There are also some serious issues with the new, low allocating file system enumeration that means it's not practical to use it yet (disappointingly, due to known non-production ready code being shipped in .NET Core 2.1).

dazinator commented 6 years ago

@cocowalla thanks for the detailed explaination - this makes sense. It sounds like this would be a good future feature to add when things are a bit more stable then. Let me know when you think that point is reached and we can re-look at this.

dazinator commented 6 years ago

Some related info here: https://blogs.msdn.microsoft.com/jeremykuhne/2018/03/09/custom-directory-enumeration-in-net-core-2-1/

pwiens commented 5 years ago

I added span support and an additional target framework. #62 Once netstandard2.1 ships netcoreapp2.1 can be changed. @dazinator I'm actually using this library with FileSystemEnumerable right now haven't tested the perf different with Span but I suspect it helps.

dazinator commented 5 years ago

Thanks @pwiens - have merged the PR, can close this one now ;-)

dazinator commented 5 years ago

I am going to do a little bit more work on this.

dazinator commented 5 years ago

I have:

  1. Added a test that uses ReadOnlySpan<char> on .NET Core 2.1.
  2. Added .NET Standard 2.1 as a compilation target, in addition to .NET Core 2.1 to that as other platforms implement .net standard they will also be able to consume this new API.
  3. Adjusted the compilation / preprocessor directives a bit to my liking! 4.. Removed the overload at the evaluator level, one less virtual method in the call chain should be better.
  4. Updated the README.

I'll release version 3.0.0 shortly.

dazinator commented 5 years ago

This has been released in version 3.0.0 - thanks for contributing!

pwiens commented 5 years ago

Thanks man! First meaningful contribution to open source after 11 years in the industry. :)