Closed danmoseley closed 2 years ago
copied danmosemsft's initial post:
Right now System.Text.RegularExpression.* only works against strings. Matching against buffers requires a string be created from the buffer. That string might not be needed again (especially if there's no match).
Motivating example: a grep tool that enumerates files in a directory tree, and scans their content against a pattern. It should not need to create strings except when a line in the buffer matches the pattern.
@stephentoub mentioned that while Regex.IsMatch could likely have a Span
This proposal obviously isn't ready for review.
@ViktorHofer @JeremyKuhne
@ViktorHofer could you take on making this into a reviewable proposal? It seems to be a fit for your regex perf mission..
@JeremyKuhne presumably your regex filter for GetFiles could operate more efficiently if IsMatch could accept a span?
Yes I already started thinking about it. Will need bit more investigation.
Even if we can do this in phases it would be awesome. Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan
+1
This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs.
So +1
Just as a summary of the Select-String
use case:
Run the Match function on every line of a file, given an encoding.
ReadOnlyMemory
seems like a good start, but there needs to be some transform from the readonly memory to a enumeration of the decoded char/rune of the specified encoding.
Maybe also provide
IEnumerable<Match> MatchInFile(string fullname, string pattern, Encoding encoding)
IEnumerable<Match> ReplaceInFile(string fullname, string pattern, string replacement, Encoding encoding)
cc @JeremyKuhne
Some discussion in https://github.com/dotnet/corefx/issues/27267
@powercode for your MatchInFile
proposal, please open a new issue using the regular API propsal format. It would need to add significant value over a simple loop over getting lines from a filestream. I'm not sure that it does. If Regex could consume an actual stream, it might. Some discussion in dotnet/corefx#27267.
Design in progress but will miss 2.1
Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan and that would open up hitting statics on Regex? Haven't looked too deeply, but the key is not the pattern, but the input.
RegexRunner.runtext is exposed (internal protected) and therefore can't be changed easily without breaking already compiled assemblies.
We should only expose these APIs if we can handle both, interpreted as well as compiled regexes, because otherwise we'd make compiled regexes even slower than today (because we need to copy the ReadOnlyMemory<char>
/ReadOnlySpan<char>
to a string).
The current API shape has two modes IsMatch
which doesn't return any results and Match
which returns class instances that point back to the original input. Ideally, we'd like the former to be able operate on spans and the latter over memories. However, this isn't easily doable without hacks (e.g. MemoryOrPinnedSpan
) as they both use the same infratructure.
We don't like the MemoryOrPinnedSpan
type and having to pass pointers internally as this seems like a farm for future AVs.
Jan proposed that we add public ref structs like CaptureValue
and MatchValue
so that we can return spans, but this wouldn't be straight forward to do with the current API shape we where have to return collections of these. This would become messy or force a persistent regex engine where the caller can enumerate the matches.
It seems the most sensbile model for the current API shape of regex is to only accept memories. Yes, it's less safe than spans, but it's more flexible and more suitable for the current API shape without crazy hacks in the implementation.
If we can change our infastructure to use ref structs internally, we can add an overload for IsMatch
that operates over a span instead of a memory.
It might be worthwhile looking into what a brand new ideal API surface for Regex would look like (both in terms of reducing allocations as well as using a better internal engine). If that shape is close to what we have, bolting it on top might be best. If it isn't it seems to be better to offer a brand new API in a new namespace.
Any progress on this?
Just my 2 cents but I'm excited for this to be a reality since it seems so logical to expose .ValueSpan
as an option on classes and sub classes of Capture. @terrajobst: I wish I had a better understanding of why this could be bad with compiled Regex.
Isn't the real difference this?
string Value => Text.Substring(Index, Length);
vs
ReadOnlySpan<char> ValueSpan => Text.AsSpan().Slice(Index, Length);
I'm definitely not a pro at Spans, but I get the underlying benefit.
Right I agree, the ValueSpan
change should go into 3.0 if possible. That said, we need an api-review for it first.
I have another use case for this proposal.
I'm writing a compiler for a simple scripting language. Provided how simple the language is, I'm planning to just use a simple regex-based tokenizer (an LL-recursive descent lexer or something would be overkill).
Now, I have exactly 25,3 MB worth of text to process. Having to create substrings for every single token means millions of allocations.
(Still, this isn't exactly a major problem for me. Performance isn't critical. I can easily allow corert to spend a few seconds allocating/deallocating strings. I ended up here mainly because I "wanted to do the right thing" and was curious to see why Regex API doesn't accept spans.)
Are there any plans to look at adding ReadOnlySpan
@michaelherndon there are not plans for .NET 6. If anyone is passionate about this, the next step is to finalize an API proposal. That would probably need a fair bit of thought -- see detailed comments by @terrajobst above.
One other thing to consider is that there are 2 regex engines (the mode that compiles at runtime and the interpreter) and any new API would need implementing in all engines. We are considering adding others -- a source generator one, which would be a variation of compiled but at build time using C# instead of ref-emit, and possibly a DFA based engine which would provide stronger execution time guarantees.. Neither of these will be in .NET 6, but the first one is probably funded for post .NET 6, possibly as an out of band release. So with 2, 3 or 4 engines, each based on quite different mechanisms, implementing new API would be some work.
@danmoseley thanks for the update. That definitely sounds like a bit of work.
@michaelherndon could you say more about your scenario?
I was debating about trying to rewrite C# code based on rails Inflector logic for pluralization and singularization. Inflector uses a fair amount of regular expressions. Humanizer has similar logic. Using ReadOnlySpan
Transforming something like Inflector into a parser to leverage ReadOnlySpan
Four years passed and the issue is still open. I hope it's implemented in NET6.
At this point it seems unlikely, given that we still don't have a design.
Uhg. Why! Why can't this happen?! It's such a logical step. Why can't we just provide a .ValueSpan
(readonly) as part of groups/captures?
See my comment above. It's not that it can't, just that it requires design work. And generally speaking the runway for .NET 6 is closing.
As the rando internet person that bumped the thread: I get the frustration of a 4-year-old feature request and obviously I do see the benefits of Regex having lower allocations. Conversely, I also get that only so many things can go into the release of a widely used framework that values backwards compatibility. Even for a company like Microsoft, there are still only so many resources and hours.
Personally, I'd still take the improvements in System.Security, the new Metrics API, or the elusive Assembly Neutral Interfaces over a new Regex API surface.
Ok... So I have to admit that I'm a bit frustrated at this one.
I've observed that it would be easy to write extensions for this, BUT Capture.Text
is internal
.
If it was exposed, then implementing my own .AsSpan()
would obviously be easy. Right now, I'm crippled to passing the original source in order to get a ReadOnlySpan<char>
or ReadOnlyMemory<char>
.
Is there a serious reason the original .Text
property is hidden?
Why not just write these simple extensions in .NET now?
This is probably not the solution you're looking for, but since the source is available, you could do some surgery:
public static class CaptureSurgery
{
private static readonly PropertyInfo _textProperty = typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic);
public static ReadOnlySpan<char> AsSpan(this Capture capture)
{
return ((string)_textProperty.GetValue(capture)).AsSpan();
}
}
Although since the point is improved performance, this is hardly a solution.
@BenjaBobs If you're worried about performance, create and cache a delegate to the getter. Also, you need to slice the text to get the value of the capture:
public static class CaptureSurgery
{
private static readonly Func<Capture, string> _textDelegate =
typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic)
.GetGetMethod(nonPublic: true)
.CreateDelegate<Func<Capture, string>>();
public static ReadOnlySpan<char> AsSpan(this Capture capture) =>
_textDelegate.Invoke(capture).AsSpan(capture.Index, capture.Length);
}
@BenjaBobs @svick I didn't know that was possible! That's awesome to know. Thank you!
And just a note: Performance? YES (obviously that's important). Superior memory usage? YES (that's the whole point of this).
@svick What version of .NET has the generic call for .CreateDelegate<T>()
? I'm implementing this in .NET Standard.
@electricessence As of right now only .NET, i.e. 5 and 6+.
@silkfire Got it to work in .NET Standard without the generic methods. I wonder if there's a boxing cost.
@joperezr, this overlaps with the span work you already did for 7 and with https://github.com/dotnet/runtime/issues/65011. We should determine if there are any other APIs from this we want to add in 7 as part of the span effort or if it's all covered and can be closed.
edit by @ViktorHofer, moved initial post down.
Spanifying Regex removes (a) unnecessary string allocations that tend to decrease perf and (b) allows different types of Memory to be processed. API proposal and implementation: https://github.com/ViktorHofer/corefx/pull/1
Proposed APIs
This diff contains the Memory overloads and the MatchEvaluator overloads. See discussion above if we should introduce new ref types for Match, Group & Capture.
Discussion points
Ref struct for Match and siblings (Capture & Group).
I had a discussion with Jan offline and he pointed out that we might want to introduce a
ref struct MatchValue
type that is returned by APIs that take Span/Memory as an input.The issues with that is that we currently have the following hiearchy: Match --> Group --> Capture and that Groups and Match contain collections of Captures/Groups.
startat overload
Should we add these startat convenience overloads for Span also? If yes, this commit should be reverted https://github.com/ViktorHofer/corefx/pull/1/commits/bf7d7f986faaa212476ceeac09ab71c8ab6c7fd5
RegexSplitEnumerator RTL yield order
If you call the Span version of Regex.Split and pass
RegexOptions.RightToLeft
to it the yield order of the enumerator will also be right to left as we start looking for matches from right to left. The current implementation (which is not an enumerator!) reverses the captured strings before returning.RegexSplitEnumerator GetEnumerator (see ref diff)