dotnet / msbuild

The Microsoft Build Engine (MSBuild) is the build platform for .NET and Visual Studio.
https://docs.microsoft.com/visualstudio/msbuild/msbuild
MIT License
5.24k stars 1.35k forks source link

Explore regex source generator #7598

Open rainersigwald opened 2 years ago

rainersigwald commented 2 years ago

.NET 7 introduces a regular expression source generator that can generate a bunch of code at compile time to make a more optimal matcher.

Since MSBuild uses many regular expressions and knows a bunch of them at compile time, we should adopt this.

Possible locations:

(non-exhaustive)

AraHaan commented 2 years ago

It would be nice if the .NET Framework version of msbuild could go away for .NET Core based ones and have Visual Studio use the one from only the .NET SDK so it does not have to bundle a version of msbuild as well to where the .NET Core msbuild can compile both .NET SDK style projects and non-sdk style projects.

This would also help improve the quality of msbuild (and Visual Studio) as a whole.

AR-May commented 2 years ago

@rainersigwald How much of perf. win do you expect here?

danmoseley commented 2 years ago

It is mostly a startup win. Note that the build for .NET Framework would still need to use the existing pattern, as alluded above. The generator is only for targeting .NET 7+

rainersigwald commented 2 years ago

I wouldn't expect a huge win on anything, but some of the pattern matches are in warm inner loops:

  1. Validating every property/item name (so hundreds/project evaluation + many at execution time).
  2. Scanning every line of ToolTask output.

As Dan mentioned, startup time is reduced and there may be possibilities for inlining/JIT magic.

I wish we could use the generator for our .NET Framework target for unification, but alas.

danmoseley commented 2 years ago

Cc @stephentoub fyi

stephentoub commented 2 years ago

I wish we could use the generator for our .NET Framework target for unification, but alas.

Yeah, you'll end up wanting to do something like:

#if NET7_0_OR_GREATER
[RegexGenerator(Pattern)]
public static partial Regex AwesomeSauce();
#else
public static Regex AwesomeSauce() => s_regex;
private static readonly Regex s_regex = new Regex(Pattern, RegexOptions.Compiled);
#endif
Forgind commented 2 years ago

Just noting this is blocked on https://github.com/dotnet/msbuild/pull/7790

Eli-Black-Work commented 2 years ago

Looks like #7790 has been merged, so this is unblocked? 🙂

rainersigwald commented 2 years ago

@Bosch-Eli-Black Yes, but to be clear it remains low-priority for the core team.

Eli-Black-Work commented 2 years ago

@rainersigwald Understood 🙂

danmoseley commented 2 years ago

@Bosch-Eli-Black I expect they'd accept a PR? (and related https://github.com/dotnet/msbuild/issues/7499 too)

Eli-Black-Work commented 2 years ago

@danmoseley Haha, ah, yes, I'd love to! 🙂 I've just begun working through my company's procedures for getting authorization to contribute to open source projects, though, and I'm afraid that may take a very long time indeed.

nikijohnson77 commented 2 years ago

Do you need any help I don't mind helping so just let me know if you do cuz I just did one of myself last night my project I got done with it last night

Forgind commented 2 years ago

@nikijohnson77 We'd love some! Thanks for volunteering; I just assigned it to you. Submit a PR whenever you're ready, and feel free to ask questions.

JaynieBai commented 1 year ago
#if NET7_0_OR_GREATER
[RegexGenerator(Pattern)]
public static partial Regex AwesomeSauce();
#else
public static Regex AwesomeSauce() => s_regex;
private static readonly Regex s_regex = new Regex(Pattern, RegexOptions.Compiled);
#endif

Do you have a definition to distinguish different target currently? such as NET7_0_OR_GREATER. @Forgind

ViktorHofer commented 1 year ago

NET7_0_OR_GREATER comes via the SDK, so every project that uses the Microsoft.NET.Sdk should have access to it.

AraHaan commented 1 year ago

NET7_0_OR_GREATER comes via the SDK, so every project that uses the Microsoft.NET.Sdk should have access to it.

only if the project targets .NET 7 in its list of target frameworks otherwise the maximum defined is NET6_0_OR_GREATER if up to .NET 6 is in it.

rainersigwald commented 1 year ago

In the MSBuild codebase we generally try to avoid generic ifdefs like NET7_0_OR_GREATER in favor of specific ones like FEATURE_REGEX_GENERATOR, defined like

https://github.com/dotnet/msbuild/blob/ee6f71d15eea77c970815d9105ab9520876c5042/src/Directory.BeforeCommon.targets#L88-L95

This is helpful because it self-documents why the code is conditionally compiled; while I don't expect it in this case it's conceivable that the regex source generator could someday work on .NET Framework 4.7.2 targets, and if it did adopting it would mean searching for the specific feature, rather than having to look at all net7+ stuff.

stephentoub commented 1 year ago

while I don't expect it in this case it's conceivable that the regex source generator could someday work on .NET Framework 4.7.2 targets

Highly unlikely ;-)

rainersigwald commented 1 year ago

But if I keep whining about it in this thread . . . 😇

We got most of our benefit from the granular ifdefs in the other direction: there were lots of features that were not in .NET Core 1.0 that got ported later, and the ifdefs let us gradually reunify a bunch of code. I still prefer the granular-ifdef strategy for its "documentation" of why things are conditional.

stephentoub commented 1 year ago

But if I keep whining about it in this thread . . . 😇

Your voice will get tired. 😛

AraHaan commented 1 year ago

Alternatively, one can't stop them from defining their own version of the .NET 7 attribute for the source generator and install the generator as a nuget package on non-.NET 7 targets, right? If so, that might also be an option to reunify the code.

stephentoub commented 1 year ago

Alternatively, one can't stop them from defining their own version of the .NET 7 attribute for the source generator and install the generator as a nuget package on non-.NET 7 targets, right? If so, that might also be an option to reunify the code.

The generator emits code that uses APIs new to .NET 7, including how the generated code plugs into Regex itself.

danmoseley commented 1 year ago

If the pattern is relatively straightforward and stable another option is to inspect the generated code and adjust it to make your own hardcoded pattern matcher, no Regex involved.

rainersigwald commented 1 year ago

Just noting that I edited the OP to add a new location, where startup time might be a noticeable win: in solution parsing

https://github.com/dotnet/msbuild/blob/01532a937629682bb2d397ffa3e0b49ef4492ac5/src/Build/Construction/Solution/SolutionFile.cs#L37-L70

Since that's one of the first things we do in a dotnet build foo.sln scenario, the compiled-already-ness would likely remove regex compilation from a critical path.

danmoseley commented 1 year ago

Reducing JITting is also helpful if Native AOT is ever a goal. (But given MSBuild reflects over assemblies it dynamically discovers, perhaps that would be challenging..)

danmoseley commented 1 year ago

@nikijohnson77 how is it going on this?

AraHaan commented 1 year ago

Wait why is the regex used for solution parsing, if only there was a library that msbuild could use that uses a much better way of parsing solution files that is just as fast as .NET 7's regex.

danmoseley commented 1 year ago

@AraHaan the change to use a source generated regex is quite localized and straightforward. If you're proposing replacing the component, that is best discussed in an issue specific to that (which I'm guessing exists)

danmoseley commented 1 year ago

OK, so I played with this and it's quite doable, but a little less than straightforward because of the complexity of MSBuild regexes, and the fact that they are built up from strings (which sometimes themselves are built up from strings).

As an extreme example, take this regex https://github.com/dotnet/msbuild/blob/a6f6699d1f70bf79db82030938d2c5e52d1e4d2e/src/Build/Evaluation/Expander.cs#L3040-L3049 which includes strings that are themselves built up, eg., https://github.com/dotnet/msbuild/blob/a6f6699d1f70bf79db82030938d2c5e52d1e4d2e/src/Build/Evaluation/Expander.cs#L3060

The generator can work on this, creating

[GeneratedRegex("((?<=@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?![^']*'(\\s*,\\s*'[^']*')?\\s*\\)))|((?<!@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?=[^']*'(\\s*,\\s*'[^']*')?\\s*\\)))|((?<!@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*@)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?![^']*'(\\s*,\\s*'[^']*')?\\s*\\)))",RegexOptions.ExplicitCapture|RegexOptions.Compiled|RegexOptions.IgnorePatternWhitespace)]

It's probably worth converting to a verbatim string (put a @ at the front and replace \ with ) in order to make them more readable. I opened suggestion https://github.com/dotnet/runtime/issues/79895 for that.

now I have

[GeneratedRegex(@"((?<=@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?![^']*'(\s*,\s*'[^']*')?\s*\)))|((?<!@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?=[^']*'(\s*,\s*'[^']*')?\s*\)))|((?<!@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*@)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?![^']*'(\s*,\s*'[^']*')?\s*\)))",RegexOptions.ExplicitCapture|RegexOptions.Compiled|RegexOptions.IgnorePatternWhitespace)]

The names of the substring variables had been acting as self documentation, so potentially one could manually add those as comments, something like this:

            [GeneratedRegex(@"
    ((?<=@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?!
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS
)) | ((?<!
    @\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?=
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS
)) | ((?<!
    @\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
@)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?!
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS      
))
", RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace)]

TBH I'm not sure that's clearer. I don't think I would bother. It would be better to just memorialize the existing fragments in comments, rather than in the regex.

Note one other wrinkle - a number of these files compile for older versions than .NET 7. So it is necessary to wrap these in #if, so something like

            internal static readonly Lazy<Regex> ItemMetadataPattern = new Lazy<Regex>(
                () => new Regex(ItemMetadataSpecification,
                    RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled));

becomes

            internal static readonly Lazy<Regex> ItemMetadataPattern = new Lazy<Regex>(() => 
#if NET7_0_OR_GREATER
                GetItemMetadataPatternRegex());
#else
                new Regex(ItemMetadataSpecification,
                    RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled));
#endif

#if NET7_0_OR_GREATER
            [GeneratedRegex("%\\(\\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*) \\s*\\)", RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace)]
            private static partial Regex GetItemMetadataPatternRegex();
#endif

These #if's aren't pretty, but it's worth it to eliminate all this JIT work at runtime when running on .NET 7+.

I note that most of the Regexes in the code are constructed lazily, since https://github.com/dotnet/msbuild/pull/1199 to delay generating the IL. I don't think that's necessary for the source generated ones, as constructing them is cheap. However, it's probably simplest to still put them inside the lazy construction, as I've done above, so that all the places that use them can remain unchanged.

danmoseley commented 1 year ago

I also hit a couple of issues using the fixer https://github.com/dotnet/runtime/issues/79891 --- this is partially fixed in .NET 8 builds https://github.com/dotnet/runtime/issues/79892 --- I dealt with this by just partially reverting the changes.

Neither stop it working, just make a little more manual tuning necessary.

@nikijohnson77 does this make sense? LMK if you need more help.

KalleOlaviNiemitalo commented 1 year ago

Doesn't the regex source generator support references to named string constants in the attribute argument?

AraHaan commented 1 year ago

I feel like the regex string can be further simplified and made smaller (by defining common parts in each as const and then use string interpolation). So then even in the .NET 7's attribute code, it would use the constant used for the input regex string.

danmoseley commented 1 year ago

If that works - great.

danmoseley commented 1 year ago

the issues I linked above are now mostly fixed. Is anyone (@AraHaan @KalleOlaviNiemitalo?) interested in doing this? might start with just enabling one or two to see how it goes.

AraHaan commented 1 year ago

I been busy lately but I could see what could be done on this when I have more time.

nikijohnson77 commented 1 year ago

Ty

nikijohnson77 commented 1 year ago

Done