Add Grammars To Documentation

niemyjski commented 5 years ago

Would it be possible to add the language grammars for C# and VB.net to the available documentation?

References https://github.com/dotnet/roslyn/issues/3169 as it should have never been closed. We need to get the grammars updated this is unacceptable to those third parties who are trying to improve the dev experience.

jcouv commented 5 years ago

There was such an attempt done (I think for C# 6): https://github.com/ljw1004/csharpspec/blob/gh-pages/csharp.g4

We also have something like a grammar (but looser) which is part of the implementation: https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Syntax/Syntax.xml

billhenn commented 5 years ago

Microsoft, are you ever going to update the grammar specifications to be newer than C# 6 and VB 11?

What we NEED and have been asking for the past several years now is this kind of thing for the latest C# and VB versions: https://github.com/dotnet/csharplang/tree/master/spec

Specifically if you guys release C# 8, I would expect that a matching spec is part of the release cycle. Otherwise, third parties like us who rely on parsing C#/VB code using our own parsers and without Roslyn are just taking wild guesses at how you implemented the languages.

It's been very frustrating that updates to grammar specifications seem to have been completely forgotten after moving to the GitHub repos several years ago.

jcouv commented 5 years ago

@BillHenning Microsoft is a big company, but the C# team is pretty small. I would love to have a consolidate grammar and spec too, but that takes significant effort and tough trade-offs with other priorities. For C# 7 and 8 we have per-feature documentation (here are the speclets and design docs per language version) at the moment.

Are you looking at consuming the grammar programmatically, or use it as a reference/documentation? If the latter, I think we could probably take the C# 6 grammar (https://github.com/ljw1004/csharpspec/blob/gh-pages/csharp.g4) and update it to C# 8. The caveat is that it cannot be consumed programmatically, because it contains some shorthands like:

input_character
    : '<Any Unicode character except a new_line_character>'
    ;

billhenn commented 5 years ago

Hi @jcouv,

Thank you for the quick reply. The problem here is that we and others have written C#/VB parsers on our own that customers expect us to keep up to date with newer C#/VB features. In our particular case, our parser is used with a syntax highlighting code editor control so that we can provide C#/VB parsing and Intellisense in the editor very similar to the Visual Studio code editor. Customers can embed these UI controls in their desktop apps and have powerful scripting functionality. That is just our implementation, but there are others out there doing similar things.

The thing is that we rely on having the official grammar specification when building our own parser grammar. Otherwise say you add feature X. We have to completely guess at the syntax and what changed in the grammar. From C# v1 all the way until v6, the formal grammar at the end of the specification was always updated appropriately and we could diff it to see the changes between versions. That stopped as soon as C#/VB moved to GitHub, and ever since then, there seems to be no attention to keeping a formal grammar updated even though nearly all other popular languages do so, like (https://docs.python.org/3/reference/grammar.html).

I know you have a lot to do with a small team but documentation is important. Keeping a grammar updated as you make changes and commit them to a version isn't that hard because there's not much in the delta. But as many versions happen over time, it compounds and gets to a point where it's harder and harder to get into sync.

We don't need to consume the grammar programmatically. We just absolutely need a definitive reference exactly like what you have at (https://github.com/ljw1004/csharpspec/blob/gh-pages/csharp.g4). If that could be updated to C# 8 and similar for VB latest, it would help immensely. We and others have been asking for this for years in various GitHub issues.

Thank you for your consideration, and I'm hoping to finally get some momentum on this again.

billhenn commented 5 years ago

@jcouv - Has there been any momentum to upgrade the C# grammar from v6 to v8 like described above? A formal official reference is needed and has been requested by us and others for years now, ever since C# v7 started getting released without related formal grammar specification updates.

gafter commented 5 years ago

The repository for specifications is csharplang.

Korporal commented 4 years ago

@niemyjski

I agree, altering the syntax of a language as releases emerge over time without documenting the grammar formally is a little sloppy. The grammar is the basis of the parser and the basis of serious understanding of the language syntax.

Since the team do implement grammar changes often and the results (usually) work well, updating the grammar documentation after the fact should be a technicality.

Instead I often must trawl the web to find examples of newer language features and deduce the grammar form these examples.

Maintaining Roslyn is more challenging too when there is no solid, team reviewed official grammar defined.

Finally how can one say that a parser or some specific parsing is right or wrong without a specification of the grammar being parsed...

Furthermore why are implemented and released changes still referred to as "language proposals" in the language documentation here (left hand side of page)? these really should be renamed to "language changes", is that a lot to ask?

CyrusNajmabadi commented 4 years ago

Instead I often must trawl the web to find examples of newer language features and deduce the grammar form these examples.

It seems like it would be much simpler to just deduce the grammar from the roslyn Syntax nodes first, and optionally the parser if there was any questions after looking at the nodes.

These basically are the canonical description of what the defacto c# impl supports, so it would be easiest to just literally "go to the source" to determine things.

I even prefer this because it may be the case that the grammar simple is wrong (it's happened before). Whereas the actual syntax nodes tell you what the team actually reviewed and implemented as the syntax to be supported.

Korporal commented 4 years ago

Instead I often must trawl the web to find examples of newer language features and deduce the grammar form these examples.

It seems like it would be much simpler to just deduce the grammar from the roslyn Syntax nodes first, and optionally the parser if there was any questions after looking at the nodes.

These basically are the canonical description of what the defacto c# impl supports, so it would be easiest to just literally "go to the source" to determine things.

I even prefer this because it may be the case that the grammar simple is wrong (it's happened before). Whereas the actual syntax nodes tell you what the team actually reviewed and implemented as the syntax to be supported.

@CyrusNajmabadi - This is an interesting stance Cyrus but you wrote "...tell you what the team actually reviewed..." - well if the grammar wasn't documented what did the team review in order to write the code?

Also you wrote "...the grammar simply is wrong..." so how could the code get written if the specification was incorrect?

Your answer also implies we no longer need the ECMA standard, rather than proving the compiler conforms to the standard we just say "Hey, the compiler IS the standard".

Grammars are readily documented, there is a standard language for describing these grammars.

YairHalberstadt commented 4 years ago

Your answer also implies we no longer need the ECMA standard, rather than proving the compiler conforms to the standard we just say "Hey, the compiler IS the standard".

To a great extent that's true. When they were porting the old compiler to Roslyn, they found many differences between the compiler and the spec. In general, they chose to keep the old behaviour, rather than the spec. Since Roslyn is the only significant C# compiler, that effectively means it defines what C# is.

The spec is useful in that:

a) it's easier to read than the compiler b) it defines how the language behaves at an abstract level, and what behaviour is guaranteed and what behaviour isnt. The compiler is therefore able to make any changes which conform to the spec, even if they could theoretically change behaviour. The spec defines what C# could be, the compiler what C# is.

Korporal commented 4 years ago

Your answer also implies we no longer need the ECMA standard, rather than proving the compiler conforms to the standard we just say "Hey, the compiler IS the standard".

To a great extent that's true. When they were porting the old compiler to Roslyn, they found many differences between the compiler and the spec. In general, they chose to keep the old behaviour, rather than the spec. Since Roslyn is the only significant C# compiler, that effectively means it defines what C# is.

The spec is useful in that:

a) it's easier to read than the compiler b) it defines how the language behaves at an abstract level, and what behaviour is guaranteed and what behaviour isnt. The compiler is therefore able to make any changes which conform to the spec, even if they could theoretically change behaviour. The spec defines what C# could be, the compiler what C# is.

@YairHalberstadt - Sadly this is becoming the new "way" to implement software, no need to specify it, just write it.

The definition of a bug used to be "doesn't conform to the specification" because good specifications are hard work and cost resources this has increasingly been sidelined to the extent they are less and less prevalent.

A bug in the modern world is now "a person with more authority than me said this is a bug".

Is it any wonder the 737 Max crashed and killed hundreds of people when these kinds of practices are permeating the industry, perhaps airplane crashes and other "accidents" are the new way to debug complex systems.

CyrusNajmabadi commented 4 years ago

@CyrusNajmabadi - This is an interesting stance Cyrus but you wrote "...tell you what the team actually reviewed..." - well if the grammar wasn't documented what did the team review in order to write the code?

The native compiler.

Also you wrote "...the grammar simply is wrong..." so how could the code get written if the specification was incorrect?

Developers wrote code and checked it in. The team felt that the code met their expectations for what the feature was supposed to be, regardless of what it was spec'ed to be. It shipped for decades. Now we're here.

Your answer also implies we no longer need the ECMA standard, rather than proving the compiler conforms to the standard we just say "Hey, the compiler IS the standard".

You are welcome to make any interpretations you want as to ecma.

CyrusNajmabadi commented 4 years ago

@YairHalberstadt - Sadly this is becoming the new "way" to implement software, no need to specify it, just write it.

What do you mean by 'new'? This has how it's been for the lifetime of the C# compiler. So you're lamenting something that has been the case for around 20 years now.

A bug in the modern world is now "a person with more authority than me said this is a bug".

Yes. The C# team has final authority on deciding was is considered a bug, and what bugs need to be fixed. This is not "the modern world", this is literally the concept of project ownership.

You are welcome to fork roslyn and take a different ownership approach on your fork.

Korporal commented 4 years ago

@YairHalberstadt - Sadly this is becoming the new "way" to implement software, no need to specify it, just write it.

What do you mean by 'new'? This has how it's been for the lifetime of the C# compiler. So you're lamenting something that has been the case for around 20 years now.

A bug in the modern world is now "a person with more authority than me said this is a bug".

Yes. The C# team has final authority on deciding was is considered a bug, and what bugs need to be fixed. This is not "the modern world", this is literally the concept of project ownership.

You are welcome to fork roslyn and take a different ownership approach on your fork.

@CyrusNajmabadi - My remarks about the new way of developing software and the new definition of a bug were general Cyrus, not confined to the C# compiler.

CyrusNajmabadi commented 4 years ago

None of this seems particularly "new". Unless you're counting the last several decades as "new" . In which case, this doesn't seem helpful as this all seems totally the norm for software development.

Korporal commented 4 years ago

@CyrusNajmabadi

Developing testing and releasing software without a specification as an input has become increasingly common over the past few decades. The degree to which this is done as a percentage of all projects underway at any point in time across the globe has been and is rising steadily.

The motives for this are financial, computer programmers used to consume specifications produced by systems analysts and produce software components. Today we have "developers" consuming requirements and often vaguely defined expectations and producing software components, thus where there were two distinct teams with a well defined interface we have one team.

These two approaches to building systems have their pros and cons, however the "new" approach seems to yield reduced system quality yet faster delivery time.

The fact is that quality is becoming more and more a secondary concern that's sacrificed for faster delivery time. The days of striving for reliability and correctness and dependability are sadly going to be a thing of the past, the "new" world (from my perspective) is one of huge volumes of tools, products, services, components, languages all being produced at a rapid rate yet with poor documentation, huge lists of issues and bugs and ever present frustration.

The state of Visual Studio 2019 on the date of its release is a case in point - it was poor yet was produced by the world's premier expert corporation so far as software development is concerned.

Personally I think this bodes ill, that's all I'm saying here, if you feel this is a desirable state of affairs then that's fine but I do not.

CyrusNajmabadi commented 4 years ago

You're welcome to opine on how development has changed over the decades. It just has no bearing or impact on what's going on with this particular issue.

Korporal commented 4 years ago

@CyrusNajmabadi - It isn't "change" I'm speaking of it's degradation in quality. It has everything to do with this issue - the issue is the ongoing absence of a documented grammar which was not initiated by me but by @niemyjski @BillHenning and others, so your welcome to opine about my remarks but try reading what these guys actually say and the challenges they face.

CyrusNajmabadi commented 4 years ago

but try reading what these guys actually say and the challenges they face.

I have. And i've given practical and concrete steps on how to address those challenges. The practical steps are also valuable because they get a reader what is probably the most likely thing they want: compatibility with the majority of the C# ecosystem.

Again, you're welcome to put forth your disappointment that the software community has moved en masse to a style of development that you don't like. But it's not going to change anything and it doesn't actually help solve any problems.

Korporal commented 4 years ago

but try reading what these guys actually say and the challenges they face.

I have. And i've given practical and concrete steps on how to address those challenges. The practical steps are also valuable because they get a reader what is probably the most likely thing they want: compatibility with the majority of the C# ecosystem.

Again, you're welcome to put forth your disappointment that the software community has moved en masse to a style of development that you don't like. But it's not going to change anything and it doesn't actually help solve any problems.

It would perhaps be helpful if a Microsoft EMPLOYEE would express an opinion here.

CyrusNajmabadi commented 4 years ago

Feel free to adhom :) My advice on a viable and effective way to move forward here still stands, regardless of my employment status. You'll note that i have not ever addressed your own points by caring one whit about your employment status :)

Roslyn is the defacto implementation of the C# language. The syntax.xml file and the generated syntax nodes are effectively the best way to know what the actual implementers believed the language is supposed to be. The antlr translation of that same file is a nice-to-have, but not a necessary thing. It's just another artifact to keep in sync, which may actually be incorrect, and which probably won't be as good for interested parties as the actual syntax model.

Korporal commented 4 years ago

@CyrusNajmabadi - Where did I refer to your employment status? this is paranoia. I merely called out the fact that nobody from Microsoft has bothered to express an official position on this question of defining the grammar.

Finally your suggestion is fine and dandy but is just a suggestion, there is no official response form Microsoft on the question of will they or won't they be producing a formal grammar definition for the latest version of C#.

Your opinion is of course an informed opinion but I'd like to see a formal answer from Microsoft to the OP's question, this is what I mean about software quality - uncertainty and confusion are the new normal.

HaloFour commented 4 years ago

@Korporal

As Cyrus used to be a Microsoft employee and member of the LDM that may have been interpreted differently than you had intended.

Honestly I think I'd rather hear from someone not from Microsoft who is working on an alternate C# parser as they likely face the biggest challenges. Any JetBrains employees in the house?

niemyjski commented 4 years ago

@HaloFour I know that @BillHenning owns and works on https://www.actiprosoftware.com and he needs this for his parsers...

billhenn commented 4 years ago

I made my points up higher in this issue thread chain if you want to review (https://github.com/dotnet/csharplang/issues/2640#issuecomment-489335471).

Right now I'm starting to try and update our parser to C# 7 as a start and it's a bit of a nightmare because some proposals have fragmented grammar changes and others have none at all. I'm left guessing at a lot.

I don't think it's unreasonable to have a formal grammar in documentation fully updated as the language is evolved. This was the case all the way through C# 6 and other popular languages do this.

CyrusNajmabadi commented 4 years ago

@BillHenning I'm curious how the syntax model nodes would compare to using the grammar. Could you clarify why those aren't viable?

billhenn commented 4 years ago

For one thing, it's hard to "diff" that between versions to know what's changed. Back when they published word documents with the specifications, I'd diff the two grammars to locate everything that was updated.

billhenn commented 4 years ago

Another perfect example is that I'm working on adding C# 7 Local Functions right now. Luckily they have some grammar-like specs in their proposal for that here (https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.0/local-functions.md).

But if I didn't have that and if I had to look at syntax nodes alone, I'd have to examine things like LocalFunctionStatementSyntax. This has a bunch of properties on them but I don't know in what order they appear in grammar or any restrictions. Like in the above proposal, I can see modifiers can only be 'async' or 'unsafe'. I would never know that my looking at the syntax node alone. That's just one small example of difficulties I encounter without a grammar.

CyrusNajmabadi commented 4 years ago

This has a bunch of properties on them but I don't know in what order they appear in grammar or any restrictions

Roslyn syntax nodes appear in linear order. The children cannot be reordered. So based on this, the grammar is:

LocalFuctionStatementSyntax: Modifiers ReturnType Identifier TypeParameterList ParameterList ConstraintClauses Block-or-ExpressionBody SemicolonToken

In Syntax.xml you will see which of these can be optional. i.e.

<Field Name="TypeParameterList" Type="TypeParameterListSyntax" Optional="true"/>

I imagine that information will soon be in the actual syntactic nodes themselves.

CyrusNajmabadi commented 4 years ago

Like in the above proposal, I can see modifiers can only be 'async' or 'unsafe'. I would never know that my looking at the syntax node alone.

There's no guarantee a grammar would tlel you this either. For example, it would be completely reasonly for the grammar to just say "modifiers" and then have the rules later on about what modifiers are actually allowed.

The language is replete with cases like this. Sometimes the grammar is used for convenience, many times we don't bother and it's just encoded as a rule somewhere outside the grammar. We also started this way in TypeScript, but then decided it was just a pain to try to put that all in teh grammar, esp. as it could make the grammar start deviating from the actual impl model. So we started moving some grammar rules to be less specific and more in line with the syntax model. And we then added more rules outside of the grammar to list restrictions.

CyrusNajmabadi commented 4 years ago

For one thing, it's hard to "diff" that between versions to know what's changed. Back when they published word documents with the specifications, I'd diff the two grammars to locate everything that was updated.

Can you not diff https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Syntax/Syntax.xml? You should be able to do this on github.com right?

billhenn commented 4 years ago

In the past the C# grammar was updated with specific keywords pretty well. I know there's no guarantee, I'm just saying it was done fairly well in the past.

I wasn't away of the Syntax.xml files before. Thanks for pointing them out. That's better than nothing I suppose, but still has challenges since it's a lot harder to read through. And it misses some information as well. For instance in the local functions work I'm on, you had written above in your grammar a "Block-or-ExpressionBody" item, but where did you get that from?

All I see in the Syntax.xml file is this surrounded by other Fields:

<Field Name="Body" Type="BlockSyntax" Optional="true"/>
<Field Name="ExpressionBody" Type="ArrowExpressionClauseSyntax" Optional="true" />

There doesn't seem to be any indication in the file that it's one or the other. That's the kind of thing that is very clear in a formal grammar.

CyrusNajmabadi commented 4 years ago

That's the kind of thing that is very clear in a formal grammar.

It doesn't have to be either. For example, it's a completely legit grammar rule to say:

LocalFunction: Modifiers Type Id TypeParameters? ParameterList Constraints? Body? ArrowExpr? Semicolon?

And then say outside of the grammar: it's only legal for a correct program to supply this combination of nodes. Again, a grammar only indicates waht is definitely not accepted. It doesn't indicate the totality of programs that are accepted.

For that, you need to go past the grammar. Note that this is common for C#. After all, there are literally infinite programs that match the grammar, an infinite number of which are illegal :)

CyrusNajmabadi commented 4 years ago

I wasn't away of the Syntax.xml files before. Thanks for pointing them out. That's better than nothing I suppose,

Yaay!

but still has challenges since it's a lot harder to read through.

If you prefer antlr style, might i suggest writing a tool that converts from Syntax.xml to antlr form? It should be super trivial to do. Now you have the best of both worlds. The real definitions that hte actual language implementation thinks are the right things (not some grammar that might be totally inconsistent with what the impl does), and some sort of grammar syntax you prefer.

You could even set it up to hyperlink between the sections for you.

If you're not interested in writing this, it's something I might end up doing myself. However, i'm very busy (launching a product and plannign a wedding ;-) ), so i can't promise it would be any time soon.

I do think it would only be like an hour of work to at least do the simple Syntax.xml -> text-antlr-grammar initially. Having one that spit out hyperlinked markdown/html would likely take a bit more time.

billhenn commented 4 years ago

The whole point of a formal grammar is to take ambiguity out of that kind of thing. And that's what they had done in all the versions through C# 6. The past grammars were very clear and wouldn't have potential completely incorrect syntax items like "Body? ArrowExpr?" in them. They would say "(Body | ArrowExp)" instead. Trust me, I've done this for years based on their past grammars and they were very helpful.

I appreciate your assistance in pointing out the syntax.xml files and for your thoughts. I'm not sure that outputting an ANTLR grammar from the syntax.xml file would help due to the issues like above, where the resulting syntax is effectively wrong.

I still feel like MS dropped the ball after open sourcing Roslyn and stopped updating a formal grammar properly. What is out there doesn't give third parties enough info to fully and accurately build a parser, and we left guessing at the rest.

CyrusNajmabadi commented 4 years ago

where the resulting syntax is effectively wrong.

It's not wrong. It's just that there are rules applied that aren't encoded syntactically.

The past grammars were very clear and wouldn't have potential completely incorrect syntax items

I find that surprising and hard to believe. For example, i'm almost 100% certain that the previous grammars would allow something like this public override void Foo<T>() where T : Whatever. This is not legal at all, but the grammar would allow this. **

Similarly, it woudl allow virtually any combination of modifiers, which is not legal either. Initially in TS we didn't do that. For example, we explicitly enumerated in the grammar the exact set of legal modifiers (in the only allowed order) for different members. But we eventually rolled that back because we decided it was just too much of a PITA to encode that all in the grammar. Instead, we put in far more lenient forms, even though they could produce illegal code, and we added the normative explanations later to state which combinations and orders of modifiers was legal.

CyrusNajmabadi commented 4 years ago

What is out there doesn't give third parties enough info to fully and accurately build a parser,

I disagree. I could fully and accurately build a parser based on syntax.xml. It would allow a superset of legal C# programs. But that has always been the case. Even in 1.0 any grammars we produced woudl allow parsers to be written that would accept a superset of the language.

The difference now is that instead of having multiple sources of syntactic truth (which can get out of sync), there is only one source. Eventually that may get encoded into a more palatable form for you. But, it's totally possible that new form may literally just be an automated tool running over syntax.xml. It would be 100% (and importantly, 100% accurate).

--

Note to self: write syntax.xml->to->antlr grammar generator.

CyrusNajmabadi commented 4 years ago

I find that surprising and hard to believe. For example, i'm almost 100% certain that the previous grammars would allow something like this public override void Foo() where T : Whatever. This is not legal at all, but the grammar would allow this. **

** I went back and checked, and i was correct on this. The previous grammars allow all sorts of syntactic forms that are absolutely illegal, and which (for example) the C# parser disallows. I looked at tons of nodes of moderate complexity, and this was true all over the place.

Do not look at the grammar to understand what is syntactically legal or to "fully and accurately build a parser". It's just not sufficient.

CyrusNajmabadi commented 4 years ago

What is out there doesn't give third parties enough info to fully and accurately build a parser,

As an 'aside' on this. I spent a lot of time on Roslyn (and TypeScript) updating the actual parser to try to use the following principles:

The goal of the parser is to take the user text and fit it into the specified syntax model.
The parser should always succeed and building a tree that roundtrips to the original source text (barring things out of its control, like not enough memory).
The only case for parser diagnostics are:
- There were tokens required that were not there. In which case the parser should produce a missing-token for the tree, and emit a diagnostic stating at least one (or more) of the tokens it expected.
- There were extra tokens that weren't valid and couldn't be consumed by introducing prior missing tokens to resync on it. In which case, the parser should add the extra token as a skipped-token and emit a diagnostic for it.

Turns out these simple rules help build incredibly robust and viable parsers for both compilation and IDE needs. First, the rules help make it 100% clear what the purview of the parser is. It's not there to make semantic determinations. It's just there to fit the original text into the specified syntactic model. Second, it makes adding features (like robust error correction or incremental parsing) much simpler. That's because those new features don't have to know all the intricacies about how the different parts of the parser may behave and how that may actually impact invariants they want to assume.

I was able to apply this approach very successfully to the TypeScript parser (which tries to follow these rules about 98% of the time). The C# parser was far away from this for a while but i was able to move it closer and closer to this point through a series of several dozens of PRs over the years. It's still a long standing hope of mine that i can get it to that 98% or above point and that any and all extra stuff it happens to do now are just pulled out entirely and there's a simple, lean-and-mean, parser that is trivial to understand and verify and which follows these rules steadfastly.

IMO, this is something C# and Roslyn could try to commit to in terms of providing information for 3rd parties. I think asking for information beyond that is certainly something that someone could do. But i would not personally not think of it as something i would expect the language/impl to do.

Korporal commented 4 years ago

@CyrusNajmabadi @BillHenning

Cyrus, are you saying Microsoft are unable or unwilling to unambiguously define the C# programming language other than the implementation itself?

What is so special about C# 7 or 8 that it cannot be defined in a standards document as was the case for C# 6?

You're right of course that certain things like allowed optional keywords and so on are not purely grammar but a language standard usually covers these separately.

C# is not a specification language so why propose that the compiler itself be the specification?

From what I've seen here the past few years, bugs in the parser are reported by users running early versions who perceive something is wrong not by users observing behavior that deviates from a standard.

May I ask, is the C# parser source code generated from a grammar description of some type?

CyrusNajmabadi commented 4 years ago

Cyrus, are you saying Microsoft are unable or unwilling to unambiguously define the C# programming language other than the implementation itself?

I'm saying that producing a secondary grammar on top of the existing grammar is not likely to be helpful (unless the secondary one is generated from the first). At best they'll just be in sync, but you'll have two artifacts that have to be maintained. At worst, they'll have deviations, in which case following the secondary artifact will likely be wrong and will lead to accepting/rejecting code that would otherwise be legal. Since Roslyn doesn't want to break back compat, we would likely go with the primary form, and that would then be what any other parser would want to accept.

What is so special about C# 7 or 8 that it cannot be defined in a standards document as was the case for C# 6?

I mean, like i said, you're welcome to do this. I'm going ot see if i can write you such a tool tonight.

C# is not a specification language so why propose that the compiler itself be the specification?

I propose the actual place where we've defined the syntax in a structured manner to be a good place to be able to expect to get the authoritative view on what is allowed/rejected.

I'm happy to give you that in antlr form as well if that's helpful.

CyrusNajmabadi commented 4 years ago

May I ask, is the C# parser source code generated from a grammar description of some type?

It is not. Hence why the syntax model serves as a current block on any impl. The parser cannot produce a tree that the syntax does not allow (at least not without skipped/missing tokens which cause diagnostics). Specifically, the syntax has no way to represent anything that isn't specified in syntax model (outside of skipped/missing tokens). So you either fit the syntax model, without diagnostics. Or you don't fit it, but you have diagnostics.

Everything else you would have to infer from the spec as it's my intention (which i did almost entirely with the TS parser, and i plan to do for the same level with the C# parser) to not have the parser do anything but fit to the syntax model. See https://github.com/dotnet/csharplang/issues/2640#issuecomment-517426805 for more details.

Effectively, i want as near as possible, one source of truth for the syntax allowed by the language. If i can get the parser down to being as simple as i mentioned in that post, then we can probably get to a literal single source of truth.

Korporal commented 4 years ago

May I ask, is the C# parser source code generated from a grammar description of some type?

It is not.

Why not? this is a little surprising to me.

Hence why the syntax model serves as the authority. The parser cannot produce a tree that the syntax does not allow (at least not without skipped/missing tokens which cause diagnostics).

You mean that parser cannot produce a tree that it cannot produce! The phrase "syntax does not allow" makes no sense here since "allow" is (according to what you've been saying) by definition what the parser allows - there is no other reference.

Everything else you would have to infer from the spec as it's my intention (which i did almost entirely with the TS parser, and i plan to do for the same level with the C# parser) to not have the parser do anything but fit to the syntax model. See #2640 (comment) for more details.

Effectively, i want as near as possible, one source of truth for the syntax allowed by the language. If i can get the parser down to being as simple as i mentioned in that post, then we can probably get to a literal single source of truth.

Well I've only every worked on one compiler, that was an implementation for PL/I on Windows and is based on the ANSI Standard. (This was a hand crafted recursive descent parser written in C).

That document (sadly not freely available, I have a printed copy) is a superb example of how to define a language lexically, syntactically and semantically.

CyrusNajmabadi commented 4 years ago

Why not? this is a little surprising to me.

Primarily because there aren't existing parser generator toolkits that generate parsers with the capabilities that are needed. Especially, but not limited to, the areas around performance, error tolerance and recovery as well as the need for extremely efficient incremental parsing.

CyrusNajmabadi commented 4 years ago

You mean that parser cannot produce a tree that it cannot produce!

No, i mean it cannot accept source code that doesn't match the syntax model without issuing diagnostics.

Korporal commented 4 years ago

Why not? this is a little surprising to me.

Primarily because there aren't existing parser generator toolkits that generate parsers with the capabilities that are needed. Especially, but not limited to, the areas around error tolerance and recovery as well as the need for extremely efficient incremental parsing.

@CyrusNajmabadi - Hmm, I wonder how much of that is attributable to the messed up C, C++ and Java grammar that was chosen in the first place, there may well have been better grammars that would have been amenable to automatic parser generation - this is speculation but one does wonder.

Korporal commented 4 years ago

You mean that parser cannot produce a tree that it cannot produce!

No, i mean it cannot accept source code that doesn't match the syntax model without issuing diagnostics.

@CyrusNajmabadi - but what is the "syntax model" - I thought you were saying that there is no syntax standard or definition and the compiler itself is the definition - perhaps I misunderstood.

CyrusNajmabadi commented 4 years ago

@CyrusNajmabadi - Hmm, I wonder how much of that is attributable to the messed up C, C++ and Java grammar that was chosen in the first place

I'm unaware of any languages that don't have the same problems here. Pl1, in particular, is exceptionally difficult to do well due to the ability for effectively any keyword to be an identifier somewhere else. As such, many common types of intermediary user editing states (i.e. where you may be lacking punctuation, or whitespace, or newlines) will be very problematic for determining what is going on.

I stay pretty well informed on parser toolkit tech. I've even explored writing some myself, and it's something i'd like to move roslyn to in the future. Right now, there really don't seem to be any viable choices out there. Pretty much all of them fail on at least one of those aspects I mentioned. And that's just a few core things we care about. There are many others as well, but i don't think it's valuable ot spend the time enumerating those all here now.

CyrusNajmabadi commented 4 years ago

@CyrusNajmabadi - but what is the "syntax model" -

The model defined by the https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Syntax/Syntax.xml file that has been linked several times in this thread. This is literally something the impl cannot override. We literally generate nodes 1:1 wiht this (as such, it serves as the generator that produces the syntactic nodes that the parser can produce).

The impl must fit the source code into this. Because of certain roslyn invariants (for example, that the tree and the original source must be 100% roundtrippable), this means that source code can either fit into that model without diagnostics, or it has too little or too much in it, and it will fit in with diagnostics.

As such, the syntactic model here serves as the best authority on what is actually allowed at the syntactic level or not.

CyrusNajmabadi commented 4 years ago

FYI @Korporal you're welcome to take this discussion to gitter. I don't think it's particular helpful in the context of this issue.

dotnet / csharplang

Add Grammars To Documentation #2640