Closed stephentoub closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Diagnostics`, `untriaged` |
Milestone: | 7.0.0 |
cc: @CyrusNajmabadi, @jasonmalinowski, @terrajobst
cc: @joperezr
FWIW, i would love this. Note: we're also leaving design space open with Raw String Literals to allow for a direct way in the language to do it. Specifically:
var x = """regex
\G(\w+\s?\w*),?
""";
Embedded languages being a first class concept would be great. We would also like to do https://github.com/dotnet/roslyn/issues/42634 so that 3rd parties could then implement the LS support for a particular embedded language (instead of Roslyn being responsible for it).
AttributeTargets.Parameter
Though I can't think of any concrete use cases off-hand, my gut reaction is that Field
and Property
seem like reasonable additions since this is meant to be more general than just Regex
.
Though I can't think of any concrete use cases off-hand, my gut reaction is that Field and Property seem like reasonable additions since this is meant to be more general than just Regex.
Yeah, I had that initially, but it seemed like there would be various places where it'd be challenging to reflect that in a tool, e.g. passing such a field by ref would lose that information, and potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off. But, I'd leave this up to @CyrusNajmabadi and @jasonmalinowski and friends, if they expect it would be useful and feasible initially. It could always be added later.
so that 3rd parties could then implement the LS support for a particular embedded language
Cool. We have APIs in runtime that accept regexes, XML, and JSON; I expect we'd want to annotate many of those with this, e.g. https://github.com/dotnet/runtime/blob/12a8819eee9865eb38bca6c05fdece1053102854/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.String.cs#L117 We'll just need to standardize on a set of names for at least the languages we use, although maybe we would just snap to something like what GitHub uses for markdown language names. Thankfully, in general, the naming choice is obvious.
potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off.
Allowing it to be set on a property/field so the IDE can directly use that for the assignment/initializer. I'm guessing there's an API out there somewhere like:
rules.Add(new Rule { RegEx = "..." });
And the IDE could process that too. I don't see a huge risk in allowing the attribute against field/property even if we don't immediately support it, in the IDE; there's absolutely ecosystem risk of not allowing it against field/property until "later" and the forcing libraries to have to wait until later .NET versions before they can target that, or they have to start #ifdefing more stuff to deal with downlevel.
Should we have a place where we document known languages as const fields?
Should we have a place where we document known languages as const fields?
You mean like adding:
public sealed class StringLanguageAttribute
{
...
public const string Regex = "regex";
public const string Xml = "xml";
public const string Json = "json";
}
?
Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).
Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).
How about System.Diagnostics.CodeAnalysis.KnownStringLanguages
? That way it's closely related to StringLanguageAttribute, but still separate.
Not sure I'd want them on this particular class (because the class name is quite lengthy
I don't have a strong opinion here. But for the number of places this will be used, I personally would be fine with e.g. [StringLanguage(StringLanguageAttribute.Json)]
. If we want to spend another type on it and can make it much shorter and still discoverable, that's fine, too.
I agree with @jasonmalinowski here. Let the attribute appear in these places. Ide can then add support as necessary.
namespace System.Diagnostics.CodeAnalysis
{
[AttributeUsage(AttributeTargets.Parameter |
AttributeTargets.Field |
AttributeTargets.Property,
AllowMultiple = false,
Inherited = false)]
public sealed class StringSyntaxAttribute : Attribute
{
public StringSyntaxAttribute(string syntax);
public string Syntax { get; }
public const string Regex = "regex";
// As we're adding more support we can add new languages like:
// public const string Xml = "xml";
// public const string Json = "json";
}
}
@stephentoub Would like to get involved with this. Feel free to assign it to me.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language
How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language
How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.
You are right :). I assumed that only mentioned languages should be supported.
@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?
Looks generally fine to me; I can't say I'm excited for putting the strings in the attribute type due to the repetition but it seems like that's the least bad option. "Language" seems easy for me to understand, but hey, I'm biased and I work on the language teams. :smile:
Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -- if we had a way to know the generator example that @stephentoub started with was going to pass a string with RegexOptions.None, then we could tweak the messages accordingly. Those aren't huge dealbreakers, but something to think about is how this might grow if we want to do those things -- is the expectation that the domain-specific info is carried in a separate attribute (maybe one that derives from this attribute), or do we do something else?
(if the assumption is that are doing inheritence, then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter than [StringLanguage(StringLanguageAttribute.Regex)])
@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?
Jason raises a good point about inheritance.
I will also mention that this may be non-trivial for us to support any attribute story. Today we operate by looking very specifically for certain APIs with certain names. e.g. new Regex("...")
. This allows us to shortcircuit from doing analysis on most strings.
We would now have to always do teh work with any string-literal parameter to check the call and see if there is such an attribute on the parameter.
I don't anticipate this being a problem. But i also can't promise that it's certain to be ok. Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?
Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?
Yes, please, that's why I asked :-)
then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
I'm not personally concerned about verbosity with usage of the attribute.
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
Ah, you're right, that won't.
Yes, please, that's why I asked :-)
:)
@jinujoseph for visibility. A strong ask from Runtime team to have a new attribute they define that can help us light up embedded language scenarios (like telling us that an API takes a regex so that we turn on our special regex features for that API). This takes us away from teh current situation where Roslyn hardcodes in all the APIs of the interest. But it means roslyn will have to do more work dynamically to figure out if we shoudl be running.
I can spend a couple of days next week prototyping our side of it so that i can benchmark and see if there are any concerns here.
@sharwell as well for thoughts.
Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -
This is a good point. For example, RegexOptions influence the way a regex is parsed. If we wanted an analyzer that flagged arguments to attributed parameters as being invalid and that same method also took an options argument, the analyzer should really know about those options. Or if the api author hardcoded the options used, the analyzer would ideally be able to pick those up from the attribute. This may not impact VS colorization today, but it certainly could in the future. And definitely other tooling. So, yeah, it needs more thought. Maybe the attribute should just have an additional Arguments property that's interpreted according to rules defined per syntax, e.g. for regex it could either be another parameter name or a RegexOptions value.
This is a good point. For example, RegexOptions influence the way a regex is parsed.
Yup. And it's worth noting that Roslyn respects this. So if you do new Regex("...", RegexOptions.IgnorePatternWhitespace)
then we understand and do the right thing.
Also, for Json, my prototype understood if you're using Newtonsoft and/or using 'strict mode' (which changes a lot of what is allowed).
for regex it could either be another parameter name or a RegexOptions value.
Yup. Those seem to eb the two main scenarios we'd care about. Either the API will hardcode different options itself (and the attribute can let us know), or the user is supposed to pass them along, and we need to know which parameter matters.
We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along, presuming that they would be for the string. It might not be the right thing in all cases, but it would likely be the right thing in most cases.
We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along
That's fine if the options are strongly typed -- do we have any interesting examples where it might just be a string? My schema examples for XML/JSON it'd be hard to know which might be the extra parameter, but I admit those examples are a bit more contrived than I'd like.
Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:
So it would be good to have a lang name for this @stephentoub
Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:
So it would be good to have a lang name for this @stephentoub
To extend on this, wouldn't it be possible to include this on the "well known" ToString formats? e.g. int.ToString("E3")
- Maybe the "lang name" should be something like ToString:Type
?
@jasonmalinowski @stephentoub
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
Ah, you're right, that won't.
We could make it convention-based such that the syntax name is inferred from the type name. For instance, XxxSyntaxAttribute
would be considered Xxx
if XxxSyntaxAttribute
extends StringSyntaxAttribute
. This would also get rid of the list of constants as this would be represented in the docs as "types deriving from StringSyntaxAttribute
From a readability perspective, I'd prefer that but I'm not feeling super strongly about that.
@CyrusNajmabadi
How about DateTimeFormat
?
Did a quick spike on this. Very easy for roslyn to support. Perf seemed fine:
@terrajobst
How about DateTimeFormat
Can you clarify what you mean here?
That was my suggestion for the name of the language
MOre info. When it comes to 'regex options' we always did it as a heuristic anyways. Given an argument we detected to be a regex-string, we just look at sibling arguments to see if any have a constant type whose symbol type is RegexOptions. If so, we presume that that must be controlling the regex-arg-string.
So i'd be fine continuing that strategy. We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do:
RegexLanguage(RegexOptions.Whatever | ...)
We'd use immo's approach of matcing hte XxxLanguage name. And we'd then use the rest of the args to determine things like options.
DateTimeFormat
That was my suggestion for the name of the language
Ah. yes, i think that's right. We don't want to call it DateTime as that's ambiguous (is this for parsing string literals, or formatting them?). So DateTimeFormat makes sense to me.
We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do
Should we have another ctor that just takes a params object[]?
public StringSyntaxAttribute(string syntax, params object[] arguments);
public object[] Arguments { get; }
That way we wouldn't require subclassing, though we could still make subclassing work if desired.
Should we have another ctor that just takes a params object[]?
public StringSyntaxAttribute(string syntax, params object[] arguments); public object[] Arguments;
A theoretical approach of mine would also be to define the type of arguments using the new attribute generics (C# 10 Feature). Then you have a type safety and don't have to cast the objects.
public StringSyntaxAttribute(string syntax, params T[] arguments);
public T[] Arguments;
This, of course, includes a class without a generic type parameter.
public StringSyntaxAttribute(string syntax);
Then you have a type safety and don't have to cast the objects.
That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.
Then you have a type safety and don't have to cast the objects.
That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.
True. Then your suggestion is probably the best choice.
I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.
I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.
Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[]
, we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.
Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[], we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.
I was going to go simpler. If it's a RegexOptions in the attribute, we use it. Otherwise we fallback to the sibling arguments. It might be possible that this might not work in some cases... but those cases seem soooo niche. i feel like the above approach would be a 99.99% solution. Do you envision cases where that would not suffice?
Do you envision cases where that would not suffice?
No, I just like being explicit rather than relying on heurstics, especially when those heuristics might need to be replicated in multiple tools, e.g. Roslyn for colorization, an analyzer that's validating syntax as part of the build, etc.
No, I just like being explicit rather than relying on heurstics,
That's fine with me. Supporting both a string-parameter-name, or RegexOptions being stored in the Arguments would be trivial i believe. Let me validate that. But given likely very little additional effort, and your own preferences here, i think this approach is fine :)
@bartonjs, opinions on the additional members? https://github.com/dotnet/runtime/issues/62505#issuecomment-1010122820
No major opinions. Public arrays are always a fun "is it shared, or is it yours?", but it's an attribute, so it's probably not going to be passed around.
The lack of static typing will mean that anything touching an arguments-based syntax highlighter needs to be prepared for garbage... but they're generally dealing with probably-not-compilable code as input, so hopefully they're nice and defensive. (Documentation will, of course, be a pain)
Background and motivation
VS provides nice colorization of regular expressions when passed to known methods on Regex, e.g.
It also supports putting a comment before any string to mark the language being used in that string:
But there's currently no way to define an arbitrary API and annotate a string parameter as to what language is expected, e.g. you don't get this colorization today for RegexGenerator, because VS hasn't been taught about this API specifically:
While these examples are about regex, more generally this applies to arbitrary domain-specific languages. If VS adds colorization for JSON, for example, it'd be nice if every API that accepts JSON could be annotated appropriately.
API Proposal
API Usage
Alternative Designs
No response
Risks
No response