AngleSharp / AngleSharp

:angel: The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.
https://anglesharp.github.io
MIT License
5.1k stars 558 forks source link

Detect closing style #1107

Closed SebastianStehle closed 1 year ago

SebastianStehle commented 1 year ago

New Feature Proposal

Description

I am using AngleSharp together with AngleSharp.Diff. I would like to detect how an element is actually closed. So I have built a very example:

using AngleSharp.Dom;
using AngleSharp.Html.Parser;

namespace ConsoleApp1
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var input1 = "<input >";
            var input2 = "<input />";

            var doc1 = new HtmlParser().ParseDocument(input1);
            var doc2 = new HtmlParser().ParseDocument(input2);

            Print(doc1, 0);
            Print(doc2, 0);
        }

        static void Print(IParentNode node, int indent)
        {
            if (node is IElement element)
            {
                Console.Write(new string(' ', indent));
                Console.WriteLine($" - {element.TagName} - {element.Flags}");
            }

            foreach (var child in node.Children)
            {
                Print(child, indent + 2);
            }
        }
    }
}

Is it possible to detect the difference here? The flags are identical for both elements.

Background

I have created a mjml renderer for .NET: https://github.com/SebastianStehle/mjml-net ... and we need to detect if our implementation provides the same output as the reference implementation in Node. So far it cannot detect closing elements, which makes a huge difference for some of the older email clients.

Specification

In case of updates that adhere to specification changes, please reference the used specification.

FlorianRappl commented 1 year ago

This is not possible via the DOM, because in HTML the self-closing (<input>) is equivalent to the tolerated XML self-closing (<input />) for the self-closing elements (otherwise, it is forbidden anyway, i.e., you cannot do <div /> as this is just <div>).

In general that how it is closed should not be important (in an HTML5 context - as AngleSharp is an HTML5 compliant parser and not for HTML4 and dialects, which might be used by email clients). As mentioned, both ways are equivalent in HTML5 (though the implicit one is the default and the explicit one is actually just tolerated).

However, if you really would like to know - look at the original token obtained via the SourceReference.

Example:

async Task Main()
{
    var parser = new HtmlParser(new HtmlParserOptions
    {
        IsKeepingSourceReferences = true,
    });
    var document = await parser.ParseDocumentAsync("<input>");
    var input = document.QuerySelector("input");

    input.SourceReference.Dump();
}

Output:

image

Now the same with <input />:

image

SebastianStehle commented 1 year ago

Awesome. This seems to help :)