RobertDober / earmark_parser

The Markdown to AST part of Earmark.
Apache License 2.0
68 stars 26 forks source link

Support unquoted attrs without space before closing > #127

Closed sebastianseilund closed 1 year ago

sebastianseilund commented 1 year ago

I had a real-world Markdown document that looked like this:

...

<meta name=”referrer” content=”no-referrer”>

...

Note that the meta tag's attributes are not using reqular quotes, but left/right double quotes (&ldquo; and &rdquo; in HTML speak). The attributes are therefore unquoted.

Earmark would raise an error:

     ** (CaseClauseError) no case clause matching: nil
     code: assert Training.extract_snippet(input) == expected
     stacktrace:
       (earmark_parser 1.4.26) lib/earmark_parser/helpers/html_parser.ex:44: EarmarkParser.Helpers.HtmlParser._parse_tag_tail/3
       (earmark_parser 1.4.26) lib/earmark_parser/helpers/html_parser.ex:10: EarmarkParser.Helpers.HtmlParser.parse_html/1
       (earmark_parser 1.4.26) lib/earmark_parser/ast/renderer/html_renderer.ex:18: EarmarkParser.Ast.Renderer.HtmlRenderer.render_html_oneline/3
       (earmark_parser 1.4.26) lib/earmark_parser/ast_renderer.ex:23: EarmarkParser.AstRenderer._render/3
       (earmark_parser 1.4.26) lib/earmark_parser.ex:526: EarmarkParser.as_ast/2
       (earmark 1.4.27) lib/earmark/internal.ex:42: Earmark.Internal.as_html/2

This is bad Markdown/HTML, but i think Earmark should handle it anyway.

I narrowed the issue down, and found out that it would fail without a space before the ending > in the tag. Something as simple as this would fail:

iex(21)> Earmark.as_html!("<img src=image.png>")
** (CaseClauseError) no case clause matching: nil
    (earmark_parser 1.4.26) lib/earmark_parser/helpers/html_parser.ex:44: EarmarkParser.Helpers.HtmlParser._parse_tag_tail/3
...

I attempted to fix the issue here by making the unquoted attribute scanning regex stop when encountering a >.

RobertDober commented 1 year ago

@sebastianseilund Thank you for your contribution

HTML support is quite clumsy anyway so I am not too shocked by maybe accepting bad HTML, the problem here is however that I would not like to introduce a behaviour which would go away in 1.5 again, potentially breaking ex_doc for some projects if they rely on such bad HTML (probably more by chance than design, but these things happen)

Right now I feel I should reject your PR, but I will give it some thought.

It is indeed sad that I refuse a PR which would allow parsing more but I feel the philosophy: "Be lenient about what you accept, and strict about what you emit" has not worked out very well for HTML of all examples.

sebastianseilund commented 1 year ago

Thanks for taking a look so quickly :)

Was it clear from my original message that Earmark crashes with a “no case clause matching” error when encountering this HTML? It’s not that it returns something incorrect, it’s that it doesn’t return at all. I can’t think of a way anyone relies on it to crash.

When converting user provided Markdown to HTML, the only other alternative I would have is to wrap Earmark.as_html in try/rescue, which doesn’t seem right to me.

Let me know what you think. Thanks!

RobertDober commented 1 year ago

Sorry I did not catch that it crashed, so I'll definitely fix the crash, I'll look again how you did it :)

I have another PR that I merged that I need to verify, and you will be next

Thank you for your work!

RobertDober commented 1 year ago

Ok finally got it, you just fixed the crash and we already accepted that kind of attributes.

Great job, gonna release tonight!!!

sebastianseilund commented 1 year ago

Thanks! :)