[API Proposal]: System.Diagnostics.CodeAnalysis.StringSyntaxAttribute

stephentoub commented 2 years ago

Background and motivation

VS provides nice colorization of regular expressions when passed to known methods on Regex, e.g.

It also supports putting a comment before any string to mark the language being used in that string:

But there's currently no way to define an arbitrary API and annotate a string parameter as to what language is expected, e.g. you don't get this colorization today for RegexGenerator, because VS hasn't been taught about this API specifically:

While these examples are about regex, more generally this applies to arbitrary domain-specific languages. If VS adds colorization for JSON, for example, it'd be nice if every API that accepts JSON could be annotated appropriately.

API Proposal

namespace System.Diagnostics.CodeAnalysis
{
    [AttributeUsage(AttributeTargets.Parameter | AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = false)]
    public sealed class StringLanguageAttribute : Attribute
    {
        public StringLanguageAttribute(string language);
        public string Language { get; }
    }
}

API Usage

public sealed class RegexGeneratorAttribute : Attribute
{
    public RegexGeneratorAttribute([StringLanguage("regex")] string pattern);
}

Alternative Designs

No response

Risks

No response

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation VS provides nice colorization of regular expressions when passed to known methods on Regex, e.g.

It also supports putting a comment before any string to mark the language being used in that string:

But there's currently no way to define an arbitrary API and annotate a string parameter as to what language is expected, e.g. you don't get this colorization today for RegexGenerator, because VS hasn't been taught about this API specifically:

While these examples are about regex, more generally this applies to arbitrary domain-specific languages. If VS adds colorization for JSON, for example, it'd be nice if every API that accepts JSON could be annotated appropriately. ### API Proposal ```C# namespace System.Diagnostics.CodeAnalysis { [AttributeUsage(AttributeTargets.Parameter)] public sealed class StringLanguageAttribute : Attribute { public StringLanguageAttribute(string language); public string Language { get; } } } ``` ### API Usage ```C# public sealed class RegexGeneratorAttribute : Attribute { public RegexGeneratorAttribute([StringLanguage("regex")] string pattern); } ``` ### Alternative Designs _No response_ ### Risks _No response_

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Diagnostics`, `untriaged`
Milestone:	7.0.0

stephentoub commented 2 years ago

cc: @CyrusNajmabadi, @jasonmalinowski, @terrajobst

stephentoub commented 2 years ago

cc: @joperezr

CyrusNajmabadi commented 2 years ago

FWIW, i would love this. Note: we're also leaving design space open with Raw String Literals to allow for a direct way in the language to do it. Specifically:

var x = """regex
        \G(\w+\s?\w*),?
        """;

Embedded languages being a first class concept would be great. We would also like to do https://github.com/dotnet/roslyn/issues/42634 so that 3rd parties could then implement the LS support for a particular embedded language (instead of Roslyn being responsible for it).

alexrp commented 2 years ago

AttributeTargets.Parameter

Though I can't think of any concrete use cases off-hand, my gut reaction is that Field and Property seem like reasonable additions since this is meant to be more general than just Regex.

stephentoub commented 2 years ago

Though I can't think of any concrete use cases off-hand, my gut reaction is that Field and Property seem like reasonable additions since this is meant to be more general than just Regex.

Yeah, I had that initially, but it seemed like there would be various places where it'd be challenging to reflect that in a tool, e.g. passing such a field by ref would lose that information, and potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off. But, I'd leave this up to @CyrusNajmabadi and @jasonmalinowski and friends, if they expect it would be useful and feasible initially. It could always be added later.

stephentoub commented 2 years ago

so that 3rd parties could then implement the LS support for a particular embedded language

Cool. We have APIs in runtime that accept regexes, XML, and JSON; I expect we'd want to annotate many of those with this, e.g. https://github.com/dotnet/runtime/blob/12a8819eee9865eb38bca6c05fdece1053102854/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.String.cs#L117 We'll just need to standardize on a set of names for at least the languages we use, although maybe we would just snap to something like what GitHub uses for markdown language names. Thankfully, in general, the naming choice is obvious.

jasonmalinowski commented 2 years ago

potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off.

Allowing it to be set on a property/field so the IDE can directly use that for the assignment/initializer. I'm guessing there's an API out there somewhere like:

rules.Add(new Rule { RegEx = "..." });

And the IDE could process that too. I don't see a huge risk in allowing the attribute against field/property even if we don't immediately support it, in the IDE; there's absolutely ecosystem risk of not allowing it against field/property until "later" and the forcing libraries to have to wait until later .NET versions before they can target that, or they have to start #ifdefing more stuff to deal with downlevel.

terrajobst commented 2 years ago

Should we have a place where we document known languages as const fields?

stephentoub commented 2 years ago

Should we have a place where we document known languages as const fields?

You mean like adding:

public sealed class StringLanguageAttribute
{
    ...
    public const string Regex = "regex";
    public const string Xml = "xml";
    public const string Json = "json";
}

?

terrajobst commented 2 years ago

Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).

FiniteReality commented 2 years ago

Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).

How about System.Diagnostics.CodeAnalysis.KnownStringLanguages? That way it's closely related to StringLanguageAttribute, but still separate.

stephentoub commented 2 years ago

Not sure I'd want them on this particular class (because the class name is quite lengthy

I don't have a strong opinion here. But for the number of places this will be used, I personally would be fine with e.g. [StringLanguage(StringLanguageAttribute.Json)]. If we want to spend another type on it and can make it much shorter and still discoverable, that's fine, too.

CyrusNajmabadi commented 2 years ago

I agree with @jasonmalinowski here. Let the attribute appear in these places. Ide can then add support as necessary.

terrajobst commented 2 years ago

Video

We think we should rename "language" to "syntax" to avoid any connotations to localization
We should only expose the language constants as we're adding support for them

namespace System.Diagnostics.CodeAnalysis
{
    [AttributeUsage(AttributeTargets.Parameter |
                    AttributeTargets.Field |
                    AttributeTargets.Property,
                    AllowMultiple = false,
                    Inherited = false)]
    public sealed class StringSyntaxAttribute : Attribute
    {
        public StringSyntaxAttribute(string syntax);
        public string Syntax { get; }

        public const string Regex = "regex";

        // As we're adding more support we can add new languages like:
        // public const string Xml = "xml";
        // public const string Json = "json";
    }
}

deeprobin commented 2 years ago

@stephentoub Would like to get involved with this. Feel free to assign it to me.

deeprobin commented 2 years ago

In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language.

stephentoub commented 2 years ago

In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language

How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.

deeprobin commented 2 years ago

In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language

How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.

You are right :). I assumed that only mentioned languages should be supported.

stephentoub commented 2 years ago

@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?

jasonmalinowski commented 2 years ago

Looks generally fine to me; I can't say I'm excited for putting the strings in the attribute type due to the repetition but it seems like that's the least bad option. "Language" seems easy for me to understand, but hey, I'm biased and I work on the language teams. :smile:

Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -- if we had a way to know the generator example that @stephentoub started with was going to pass a string with RegexOptions.None, then we could tweak the messages accordingly. Those aren't huge dealbreakers, but something to think about is how this might grow if we want to do those things -- is the expectation that the domain-specific info is carried in a separate attribute (maybe one that derives from this attribute), or do we do something else?

jasonmalinowski commented 2 years ago

(if the assumption is that are doing inheritence, then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter than [StringLanguage(StringLanguageAttribute.Regex)])

CyrusNajmabadi commented 2 years ago

@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?

Jason raises a good point about inheritance.

I will also mention that this may be non-trivial for us to support any attribute story. Today we operate by looking very specifically for certain APIs with certain names. e.g. new Regex("..."). This allows us to shortcircuit from doing analysis on most strings.

We would now have to always do teh work with any string-literal parameter to check the call and see if there is such an attribute on the parameter.

I don't anticipate this being a problem. But i also can't promise that it's certain to be ok. Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?

stephentoub commented 2 years ago

Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?

Yes, please, that's why I asked :-)

then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter

How would that work when Roslyn can't necessarily see the argument passed to the base ctor?

I'm not personally concerned about verbosity with usage of the attribute.

jasonmalinowski commented 2 years ago

How would that work when Roslyn can't necessarily see the argument passed to the base ctor?

Ah, you're right, that won't.

CyrusNajmabadi commented 2 years ago

Yes, please, that's why I asked :-)

:)

@jinujoseph for visibility. A strong ask from Runtime team to have a new attribute they define that can help us light up embedded language scenarios (like telling us that an API takes a regex so that we turn on our special regex features for that API). This takes us away from teh current situation where Roslyn hardcodes in all the APIs of the interest. But it means roslyn will have to do more work dynamically to figure out if we shoudl be running.

I can spend a couple of days next week prototyping our side of it so that i can benchmark and see if there are any concerns here.

@sharwell as well for thoughts.

stephentoub commented 2 years ago

Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -

This is a good point. For example, RegexOptions influence the way a regex is parsed. If we wanted an analyzer that flagged arguments to attributed parameters as being invalid and that same method also took an options argument, the analyzer should really know about those options. Or if the api author hardcoded the options used, the analyzer would ideally be able to pick those up from the attribute. This may not impact VS colorization today, but it certainly could in the future. And definitely other tooling. So, yeah, it needs more thought. Maybe the attribute should just have an additional Arguments property that's interpreted according to rules defined per syntax, e.g. for regex it could either be another parameter name or a RegexOptions value.

CyrusNajmabadi commented 2 years ago

This is a good point. For example, RegexOptions influence the way a regex is parsed.

Yup. And it's worth noting that Roslyn respects this. So if you do new Regex("...", RegexOptions.IgnorePatternWhitespace) then we understand and do the right thing.

Also, for Json, my prototype understood if you're using Newtonsoft and/or using 'strict mode' (which changes a lot of what is allowed).

for regex it could either be another parameter name or a RegexOptions value.

Yup. Those seem to eb the two main scenarios we'd care about. Either the API will hardcode different options itself (and the attribute can let us know), or the user is supposed to pass them along, and we need to know which parameter matters.

We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along, presuming that they would be for the string. It might not be the right thing in all cases, but it would likely be the right thing in most cases.

jasonmalinowski commented 2 years ago

We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along

That's fine if the options are strongly typed -- do we have any interesting examples where it might just be a string? My schema examples for XML/JSON it'd be hard to know which might be the extra parameter, but I admit those examples are a bit more contrived than I'd like.

CyrusNajmabadi commented 2 years ago

Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:

So it would be good to have a lang name for this @stephentoub

FiniteReality commented 2 years ago

Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:

So it would be good to have a lang name for this @stephentoub

To extend on this, wouldn't it be possible to include this on the "well known" ToString formats? e.g. int.ToString("E3") - Maybe the "lang name" should be something like ToString:Type?

terrajobst commented 2 years ago

@jasonmalinowski @stephentoub

How would that work when Roslyn can't necessarily see the argument passed to the base ctor?

Ah, you're right, that won't.

We could make it convention-based such that the syntax name is inferred from the type name. For instance, XxxSyntaxAttribute would be considered Xxx if XxxSyntaxAttribute extends StringSyntaxAttribute. This would also get rid of the list of constants as this would be represented in the docs as "types deriving from StringSyntaxAttribute

From a readability perspective, I'd prefer that but I'm not feeling super strongly about that.

@CyrusNajmabadi

How about DateTimeFormat?

CyrusNajmabadi commented 2 years ago

Did a quick spike on this. Very easy for roslyn to support. Perf seemed fine:

CyrusNajmabadi commented 2 years ago

@terrajobst

How about DateTimeFormat

Can you clarify what you mean here?

terrajobst commented 2 years ago

That was my suggestion for the name of the language

CyrusNajmabadi commented 2 years ago

MOre info. When it comes to 'regex options' we always did it as a heuristic anyways. Given an argument we detected to be a regex-string, we just look at sibling arguments to see if any have a constant type whose symbol type is RegexOptions. If so, we presume that that must be controlling the regex-arg-string.

So i'd be fine continuing that strategy. We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do:

RegexLanguage(RegexOptions.Whatever | ...)

We'd use immo's approach of matcing hte XxxLanguage name. And we'd then use the rest of the args to determine things like options.

CyrusNajmabadi commented 2 years ago

DateTimeFormat

That was my suggestion for the name of the language

Ah. yes, i think that's right. We don't want to call it DateTime as that's ambiguous (is this for parsing string literals, or formatting them?). So DateTimeFormat makes sense to me.

stephentoub commented 2 years ago

We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do

Should we have another ctor that just takes a params object[]?

public StringSyntaxAttribute(string syntax, params object[] arguments);
public object[] Arguments { get; }

That way we wouldn't require subclassing, though we could still make subclassing work if desired.

deeprobin commented 2 years ago

Should we have another ctor that just takes a params object[]?
public StringSyntaxAttribute(string syntax, params object[] arguments);
public object[] Arguments;

A theoretical approach of mine would also be to define the type of arguments using the new attribute generics (C# 10 Feature). Then you have a type safety and don't have to cast the objects.

public StringSyntaxAttribute(string syntax, params T[] arguments);
public T[] Arguments;

This, of course, includes a class without a generic type parameter.

public StringSyntaxAttribute(string syntax);

stephentoub commented 2 years ago

Then you have a type safety and don't have to cast the objects.

That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.

deeprobin commented 2 years ago

Then you have a type safety and don't have to cast the objects.

That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.

True. Then your suggestion is probably the best choice.

CyrusNajmabadi commented 2 years ago

I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.

stephentoub commented 2 years ago

I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.

Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[], we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.

CyrusNajmabadi commented 2 years ago

Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[], we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.

I was going to go simpler. If it's a RegexOptions in the attribute, we use it. Otherwise we fallback to the sibling arguments. It might be possible that this might not work in some cases... but those cases seem soooo niche. i feel like the above approach would be a 99.99% solution. Do you envision cases where that would not suffice?

stephentoub commented 2 years ago

Do you envision cases where that would not suffice?

No, I just like being explicit rather than relying on heurstics, especially when those heuristics might need to be replicated in multiple tools, e.g. Roslyn for colorization, an analyzer that's validating syntax as part of the build, etc.

CyrusNajmabadi commented 2 years ago

No, I just like being explicit rather than relying on heurstics,

That's fine with me. Supporting both a string-parameter-name, or RegexOptions being stored in the Arguments would be trivial i believe. Let me validate that. But given likely very little additional effort, and your own preferences here, i think this approach is fine :)

stephentoub commented 2 years ago

@bartonjs, opinions on the additional members? https://github.com/dotnet/runtime/issues/62505#issuecomment-1010122820

bartonjs commented 2 years ago

No major opinions. Public arrays are always a fun "is it shared, or is it yours?", but it's an attribute, so it's probably not going to be passed around.

The lack of static typing will mean that anything touching an arguments-based syntax highlighter needs to be prepared for garbage... but they're generally dealing with probably-not-compilable code as input, so hopefully they're nice and defensive. (Documentation will, of course, be a pain)

stephentoub commented 2 years ago

https://github.com/dotnet/runtime/pull/62995

dotnet / runtime