JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
https://html-to-markdown.com
MIT License
892 stars 85 forks source link

'### Heading' expected for '<h3>Heading</h3>', but get '**Heading**' #103

Closed mehrvarz closed 4 months ago

mehrvarz commented 4 months ago

HTML input <h3>Heading</h3> should generate ### Heading. And it usually does.

But sometimes I see **Heading** being generated instead. What could be causing this?

JohannesKaufmann commented 4 months ago

@mehrvarz can you post a reproducible example?


Are these headings inside links? If yes, here is reason:

If you want a heading inside a link, that does not work. While the # heading is a block element, the [link](href) is an inline element. And it is invalid to have block elements inside inline elements. https://html-to-markdown.com/docs/heading-in-link

The best alternative is rendering the heading as bold text instead (see source).

mehrvarz commented 4 months ago

Thank you for your response. There are two anchor elements right in front of the <h3> element. But they are closed. Can they still influence the <h3> element in such a way? (I have added some newlines to this 3rd party HTML for clarity:)

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <link href="content.78.css" rel="stylesheet" type="text/css"/>
    <title>Some title</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body style="background-color: #ffffff;">
<div>
<a id="d15e12847"/>
<a id="navpoint.d15e11188"/>
<h3 class="p_l-h3">Header text</h3>

Edit: This markup comes in a file that has .xhtml in it's file name.

JohannesKaufmann commented 4 months ago

You can try to Parse and then Render your snippet using the "golang.org/x/net/html" package.

It is very likely that the h3 tag actually ends up inside the a tag. The a tag cannot usually be self-closing in html5...

Screenshot 2024-06-29 at 19 31 43
mehrvarz commented 4 months ago

Holla! I can fix the issue by dynamically converting all <a id="..."/> to <a id="..."></a> in a preprocessing step. Feels a little expensive, but it may be cheaper than a full Parse and Render. Need to think about it... Ideally, your code would act differently based on xmlns. Right? For now, I can live with this. Thank you very much!

// convert "<a .../>" to "<a ...></a>"
idxAll := 0
idxAnchor := strings.Index(htmlStr,"<a ")
for idxAnchor>=0 {
    idxAll = idxAll + idxAnchor
    idxCloseAnchor := strings.Index(htmlStr[idxAll+3:],">")
    if idxCloseAnchor>=0 {
        idxAll = idxAll + 3 + idxCloseAnchor
        if htmlStr[idxAll+3+idxCloseAnchor-1] == '/' {
            htmlStr = htmlStr[:idxAll+3+idxCloseAnchor-1] + "></a>" +
                htmlStr[idxAll+3+idxCloseAnchor+1:]
            idxAll = idxAll + 3
        }
    }
    idxAnchor = strings.Index(htmlStr[idxAll:],"<a ")
}