KirillOsenkov / XmlParser

A Roslyn-inspired full-fidelity XML parser with no dependencies and a simple Visual Studio XML language service
Apache License 2.0
328 stars 49 forks source link

feat: 2.2.12 [XML] Section 3.3.3, Attribute-Value Normalization #45

Closed workgroupengineering closed 1 year ago

workgroupengineering commented 1 year ago

Address spec

Currently the TestParser.ExhaustiveSubstring test fails, but I don't understand why. Do you have an idea?

workgroupengineering commented 1 year ago

What is the purpose of "TextParser.ExhaustiveSubstring"? I do not understand. This PR is required for using this parser in Avalonia XamlC. To do it fautled tollerance.

KirillOsenkov commented 1 year ago

The purpose of this parser is to provide a fully roundtrippable tree representation of a text document that may or may not contain valid XML. It should try its best to also represent partially valid XML and be error tolerant. Strictly speaking it should be able to correctly parse any plain text, but if it is valid XML it should build a tree over it that associated every character in the text with some location in the tree.

Being fully roundtrippable means you can walk the resulting tree and exactly reconstruct the source text, character by character identical to the original. It uses nodes, tokens and trivia to represent every character in the text. For XmlTextToken tokens, Text represents a span in the original text exactly as is. Replacing the Text with a normalized value of the attribute would defeat the purpose of being roundtrippable.

The ExhaustiveSubstring test validates that we can parse any substring of a typical XML document and return a parse tree that a) covers every character and b) correctly reconstructs the exact source text from the tree. Once you return an XmlTextToken whose Text doesn't exactly represent the source code, the test correctly fails.

I can totally see adding XmlAttributeSyntax.NormalizedValue that returns the normalized value, or directly normalizing the XmlAttributeSyntax.Value. I think just .Replace("\r", " ").Replace("\n", " ").Replace("\t", " ").Replace(" ", " ").Trim() inside the Value getter should be good enough.

KirillOsenkov commented 1 year ago

BTW Roslyn stores both the Text and the Value on SyntaxToken: https://github.com/dotnet/roslyn/blob/c5d20779f421caafbd4e18e3bfc5c0320a0a29f8/src/Compilers/Core/Portable/Syntax/SyntaxToken.cs#L144-L156

I can imagine taking the value text from Scratch in the scanner and storing it together with the Text on the XmlTextToken. However this would increase memory consumption of the syntax trees.

I think it's better to normalize the text on the fly in the getter of Value.

KirillOsenkov commented 1 year ago

I've published the new NuGet versions: