AutoRest fails to produce the expected XML documentation in .NET libraries

mmacy commented 5 years ago

ℹ️ Updated: 02 January 2023

Problem

When AutoRest generates C# XML documentation comments in .NET libraries from OpenAPI description files whose description fields contain HTML markup tags, the tags are merely escaped by transforming the tag delimiters into their HTML character entity equivalents. This leads to downstream tools like IntelliSense, DocFx, and other documentation processors rendering unexpected (incorrect) output that doesn't reflect OpenAPI description file author's intended formatting.

For example, the angle brackets in HTML tags like <b>, <ul>, and <li> that appear in the description field of OpenAPI descriptions (aka "Swagger files") are transformed into their character entity equivalents < and > in the XML documentation comments generated by AutoRest.

Duplicate issues:

Expected

AutoRest generates well-formed XML in C# doc comments per the C# language specification.
AutoRest generates doc comments that reflect the OpenAPI description author's intent.
AutoRest has markup support parity with OpenAPI 2.0 and 3.0 specifications.
- Specifically for this issue is parity support for handling the markup that appears in the description field of applicable entity types defined in an OpenAPI description file.

Expected behavior: two options

There are a couple options I can think of that might be considered as a fix:

Expected behavior option 1: atomic tag transform 👈 preferred
- Transform each HTML tag found in the OpenAPI description file's description fields into their XML-compliant equivalents like <para> and <list> when generating C# XML documentation comments in .NET libraries.
- More info in comment https://github.com/Azure/autorest.csharp/issues/208#issuecomment-461630640
Expected behavior option 2: CDATA
- Enclose the entire contents of the OpenAPI description file's description fields within XML CDATA tags when generating C# XML documentation comments in .NET libraries.
- More info in comment https://github.com/Azure/autorest.csharp/issues/208#issuecomment-1046056381

Encountered

OpenAPI description (Swagger file) - description field of adminUsername from Microsoft.Compute/stable/2022-03-01/ComputeRP/computeRPCommon.json#L1360

"description": "Specifies the name of the administrator account. <br><br> This property cannot be updated after the VM is created. <br><br> **Windows-only restriction:** Cannot end in \".\" <br><br> **Disallowed values:** \"administrator\", \"admin\", \"user\", \"user1\", \"test\", \"user2\", \"test1\", \"user3\", \"admin1\", \"1\", \"123\", \"a\", \"actuser\", \"adm\", \"admin2\", \"aspnet\", \"backup\", \"console\", \"david\", \"guest\", \"john\", \"owner\", \"root\", \"server\", \"sql\", \"support\", \"support_388945a0\", \"sys\", \"test2\", \"test3\", \"user4\", \"user5\". <br><br> **Minimum-length (Linux):** 1  character <br><br> **Max-length (Linux):** 64 characters <br><br> **Max-length (Windows):** 20 characters."

AutoRest-generated source code - OSProfile.cs

/// <summary> Specifies the name of the administrator account. &lt;br&gt;&lt;br&gt; This property cannot be updated after the VM is created. &lt;br&gt;&lt;br&gt; **Windows-only restriction:** Cannot end in &quot;.&quot; &lt;br&gt;&lt;br&gt; **Disallowed values:** &quot;administrator&quot;, &quot;admin&quot;, &quot;user&quot;, &quot;user1&quot;, &quot;test&quot;, &quot;user2&quot;, &quot;test1&quot;, &quot;user3&quot;, &quot;admin1&quot;, &quot;1&quot;, &quot;123&quot;, &quot;a&quot;, &quot;actuser&quot;, &quot;adm&quot;, &quot;admin2&quot;, &quot;aspnet&quot;, &quot;backup&quot;, &quot;console&quot;, &quot;david&quot;, &quot;guest&quot;, &quot;john&quot;, &quot;owner&quot;, &quot;root&quot;, &quot;server&quot;, &quot;sql&quot;, &quot;support&quot;, &quot;support_388945a0&quot;, &quot;sys&quot;, &quot;test2&quot;, &quot;test3&quot;, &quot;user4&quot;, &quot;user5&quot;. &lt;br&gt;&lt;br&gt; **Minimum-length (Linux):** 1  character &lt;br&gt;&lt;br&gt; **Max-length (Linux):** 64 characters &lt;br&gt;&lt;br&gt; **Max-length (Windows):** 20 characters. </summary>

How it's rendered Intellisense:

Rendered on docs.microsoft.com - OSProfile.AdminPassword:

shahabhijeet commented 5 years ago

@mmacy

the example you have shown for GFM is going away for Open API 3.0, so any support we add for GFM is limited until Open API 2.0
For more clarification, can you provide an example as to what do you expect in generated code for C# to look like with those GFM style description?
1. Are you suggesting GFM should be passed in unescaped (that causes an additional warning for each violation in the generated C# code)
2. Will need to verify with the documentation team, if they can generate documentation using generated xml documentation file which will have badly formed xml comments.
3. Also we cannot go beyond than what is supported in C# for xml comments which is adhering to W3C XML Recommendation

mmacy commented 5 years ago

@shahabhijeet

Good info, thanks for the clarifications on what is/isn't supported. It seems we have a bit of a conflict between what's supported in Open API 2.0 versus that which is supported in XML comments in C#.

Per the W3C XML Recommendation, the ampersand left angle bracket must not appear in their literal form:

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.

Further, it also states that such characters must be escaped, which AutoRest is currently performing (emphasis mine):

If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and " <" respectively. The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

@dendeli-msft AutoRest is adhering to the specifications set forth by the W3C XML Recommendation. As such, the < and > escape sequences appearing in C# XML docs are as designed. What recourse might we have, if any, to handle proper rendering of XML code comments that include HTML tags with escaped < and > characters?

lmazuel commented 5 years ago

@shahabhijeet For my own curiosity as a non-C# developer, what is the expected content type inside <summary>?

For instance, I see link that way:

/// used in this field, see [Selecting User Names for Linux on
/// Azure](https://docs.microsoft.com/azure/virtual-machines/virtual-machines-linux-usernames?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json)

which suggest the content type is Markdown. I guess there is a specification / best practices in the C# world to describe the content type expected from inside the <summary>? For instance in Java, this is definitely HTML (text/html)

mmacy commented 5 years ago

@lmazuel The C# language specification states that only valid XML is supported:

The text within documentation comments must be well formed according to the rules of XML (https://www.w3.org/TR/REC-xml). If the XML is ill formed, a warning is generated and the documentation file will contain a comment saying that an error was encountered.

However, we apparently handle some Markdown in OPS platform to handle that Markdown and render it on docs. ~~According to the language spec, though, it's apparently not "legal" to have Markdown in XML code comments.~~ EDIT: If I understand correctly, Markdown that doesn't contain HTML tags would, in fact, be valid in XML comments (as long as that Markdown doesn't include < or >).

The C# Programming Guide has information on the various recommended tags in XML comments; you can find the article on <summary> here:

\<summary> (C# Programming Guide)

In that same guide, you can see other formatting tags like <para> and <list> which can be used within a <summary> section to provide richer formatting. Perhaps functionality needs to be added to the .NET generator for AutoRest that converts the ~~Markdown and~~ HTML in a Swagger file description field into those XML doc equivalents?

lmazuel commented 5 years ago

Ok, thought "summary" was a place holder for TEXT (CDATA like) but it's actually a proper XML language.

For that I see, there need to be a translation from Markdown to this XML format.

Example:

/// used in this field, see [Selecting User Names for Linux on
/// Azure](https://docs.microsoft.com/azure/virtual-machines/virtual-machines-linux-usernames?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json)

should be (ref)

/// used in this field, see <see href="https://docs.microsoft.com/azure/virtual-machines/virtual-machines-linux-usernames?">
/// Selecting User Names for Linux on Azure</see>

or this:

text 1 <br/> text 2

should be (ref from ScottHa)

<para>text 1</para>
<para>text 2</para>

Again, I'm not a C# dev but this is my understanding of the XML spec.

And yes, it's not trivial, I never said that...

lmazuel commented 5 years ago

Plan B would be doc team process "summary" as a big bloc of Markdown/HTML instead of the expected summary XML format. That could be a pragmatic way to solve it, but then we should keep in mind that we tight a particular MS-specific output to a particular specific MS-specific doc input.

And I throw myself under the bus for Python as well, this is the only reason why Python doc works: I write Markdown in my DocString (where I should write RST only), and it works just because Doc team happen to parse it by chance. So again, I really don't say it's trivial to do this transformation.

mmacy commented 5 years ago

@shahabhijeet

For more clarification, can you provide an example as to what do you expect in generated code for C# to look like with those GFM style description?

I'd expect AutoRest to generate the following, which uses C#'s <para>, <list>, and <item> tags in place of the HTML tags currently in place in the Swagger file's description field for this member. This example is not ideal (see the extra <para> tags within the <list>, for example) but I've kept in mind how the code generator would see and then possibly convert it. For side-by-side comparison, see this Gist.

/// <summary>
/// Gets or sets specifies the name of the administrator account.
/// <para>**Windows-only restriction:**
/// Cannot end in "." </para><para>**Disallowed
/// values:** "administrator", "admin", "user", "user1", "test",
/// "user2", "test1", "user3", "admin1", "1", "123", "a", "actuser",
/// "adm", "admin2", "aspnet", "backup", "console", "david", "guest",
/// "john", "owner", "root", "server", "sql", "support",
/// "support_388945a0", "sys", "test2", "test3", "user4", "user5".</para>
/// <para>**Minimum-length (Linux):** 1
/// character</para><para>**Max-length
/// (Linux):** 64 characters </para><para>
/// **Max-length (Windows):** 20 characters</para>
/// <para><list type="bullet"><item> For root
/// access to the Linux VM, see [Using root privileges on Linux virtual
/// machines in
/// Azure](https://docs.microsoft.com/azure/virtual-machines/virtual-machines-linux-use-root-privileges?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json)</item></para><para><list type="bullet"><item>
/// For a list of built-in system users on Linux that should not be
/// used in this field, see [Selecting User Names for Linux on
/// Azure](https://docs.microsoft.com/azure/virtual-machines/virtual-machines-linux-usernames?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json)</para></item>
/// </summary>
[JsonProperty(PropertyName = "adminUsername")]
public string AdminUsername { get; set; }

Are you suggesting GFM should be passed in unescaped (that causes an additional warning for each violation in the generated C# code)

Originally, yes. However, as described my previous comment above, GFM (or, in this case, the HTML tags supported by GFM) should instead be replaced by their C# XML comment-supported equivalents.

Will need to verify with the documentation team, if they can generate documentation using generated xml documentation file which will have badly formed xml comments.

We should avoid badly formed XML comments.

Also we cannot go beyond than what is supported in C# for xml comments which is adhering to W3C XML Recommendation

Thanks, this was good info.

Please let me know if you need any additional information.

Cc: @dendeli-msft

mmacy commented 5 years ago

Ping @shahabhijeet :bell:

Is there anything else you need here?

supernova-eng commented 5 years ago

@shahabhijeet the issue is still active for our documentation. Do you happen to have any updates?

mimisasouvanh commented 5 years ago

Hi @shahabhijeet Would you please give me an update on where your team is on this issue? Do you need more info from us?

mimisasouvanh commented 4 years ago

ping.. any update on this?

mmacy commented 2 years ago

@lmazuel @jsquire @mimisasouvanh In https://github.com/Azure/autorest.csharp/issues/208#issuecomment-461630640, I had responded to Abhijeet with a proposed fix for this:

I'd expect AutoRest to generate the following, which uses C#'s <para>, <list>, and <item> tags in place of the HTML tags currently in place in the Swagger file's description field for this member. *snip*

However, I think there might be ~~a much better~~ another way. ~~than both that and the current AutoRest behavior of escaping (encoding) individual characters.~~

Alternate fix proposal: CDATA

What if, instead of AutoRest swapping out the individual HTML tags for their XML equivalents as I'd suggested in https://github.com/Azure/autorest.csharp/issues/208#issuecomment-461630640, we surrounded the entire description field value in a CDATA block, and stuff that into the <summary> block of the XML doc comments in the C# code?

Given this description field value from Microsoft.Compute/compute.json:

"description": "Specifies how the virtual machine should be created.<br><br> Possible values are:<br><br> **Attach** \\u2013 This value is used when you are using a specialized disk to create the virtual machine.<br><br> **FromImage** \\u2013 This value is used when you are using an image to create the virtual machine. If you are using a platform image, you also use the imageReference element described above. If you are using a marketplace image, you  also use the plan element previously described.",

AutoRest would render this valid XML in the triple-slash XML doc comments:

/// <summary>
/// <![CDATA[
/// Specifies how the virtual machine should be created.<br><br> Possible values are:<br><br> **Attach** \\u2013 This value is used when you are using a specialized disk to create the virtual machine.<br><br> **FromImage** \\u2013 This value is used when you are using an image to create the virtual machine. If you are using a platform image, you also use the imageReference element described above. If you are using a marketplace image, you  also use the plan element previously described.
/// ]]>
/// </summary>

Enclose in CDATA by default?

Per 2.7 CDATA sections in the W3C XML spec:

[Definition](): CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>"

There might be an obvious reason we're not already doing this, and it would need investigation and testing to determine whether it'd break anything, but we ostensibly could by default enclose the entire description field value in CDATA tags.

Benefits
- Easier to implement than swapping individual markup tags, thus less error-prone.
- Not as brittle because would handle all markup in the description field, including:
- GFM (OpenAPI 2.0)
- CommonMark (OpenAPI 3.0)
- HTML (OpenAPI 2.0 and 3.0)
Risks
- IntelliSense and likely any other downstream tools still wouldn't render the intended formatting.

danieljurek commented 2 years ago

If my read of this issue is correct...

When converting from a description written in the swagger where the description contains HTML formatting characters we're having trouble rendering those outputs to look as intended in downstream processes. Since XML comments in C# code are treated as XML text, the presence of character sequences that look like XML tags is not allowed and the characters must be converted into escaped sequences (e.g. <br> -> <br>)...

If this is the case then a few obvious options are:

Changes to how we generate C# code and comments in autorest to accommodate generating these tags in a way that the docs will be rendered correctly on the docs site or in other places like Visual Studio tooltips (e.g. translate HTML into accepted doc comment tags like <para>).
Changes to processes downstream of the code (e.g. docfx, docs yaml generation) to decode these sequences back into HTML and render them as expected. We'd also have to consider how these look when rendered in a tool like Visual Studio tooltips.
Changes to code generation and possibly downstream docs (docs.microsoft.com, Visual Studio tooltips) to render CDATA properly
Changes to swagger fields like description to not use HTML tags and use a different approach to formatting (this may impact rendering for docs in other languages which use different tools to generate reference documentation)

Who can help decide what the right path forward is here?

cc @scottaddie, @mimisasouvanh

lmazuel commented 2 years ago

@danieljurek only option1 is sustainable to me.

Swagger description are written in MD.
C# docstring is some form of XML doc
Thefore, autorest needs to be able to transform MD to the right XML doc. Autorest is just a smart SDK writer, it should write actual good C#, including doc.

For instance with Python:

Swagger are MD
Python Docstring are RST
Therefore we plugged a "m2r" (MD to RST) plugin that we apply to all description string

I feel C# needs something like that.

CC @AlexanderSher

danieljurek commented 2 years ago

Here's an example of this happening in a track 2 package today @AlexanderSher

Sandido commented 2 years ago

@mmacy , @danieljurek , Did the CDATA solution progress beyond theory?

mmacy commented 2 years ago

@Sandido Not that I'm aware of, no. I haven't worked directly with OpenAPI specs for several years and have sadly never found the time to test it out. 🙁

jsquire commented 2 years ago

I'm classifying this as v3, as any fixes will be applied to this version.

m-nash commented 2 years ago

@danieljurek can we validate that manually making those encoding changes yields the expected result on the docs page?

danieljurek commented 1 year ago

@mmacy -- Did you do some testing around these changes that could validate @m-nash's question?

mmacy commented 1 year ago

Sorry @danieljurek, no, I have not manually tested the proposed changes.

Apologies for the delay in response, as well. I moved on from Microsoft at the end of October and got caught up in the switch-to-a-new-company-and-new-job thing, and your question got lost in the fray.

On that note, there'll likely be no change in testing status from me on this issue, but I'll certainly continue following it. 🙂

Cc: @m-nash

Azure / autorest.csharp