JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
https://html-to-markdown.com
MIT License
892 stars 85 forks source link

🐛 How to use converter.Remove? #111

Closed kai687 closed 1 month ago

kai687 commented 1 month ago

Describe the bug

Not sure if it's a bug, but I can't get converter.Remove to work.

HTML Input

<strong>Hello</strong>

Generated Markdown

**Hello**

Expected Markdown

Additional context

Using v1.6.0 of html-to-markdown and this code:

package main

import (
    "fmt"

    md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
    converter := md.NewConverter("", true, nil)
    converter.Remove("strong")
    html := `<strong>Hello</strong>`

    markdown, err := converter.ConvertString(html)
    if err != nil {
        panic(err)
    }

    fmt.Println(markdown)
}
JohannesKaufmann commented 1 month ago

In the V1 the behaviour is a bit unexpected: The remove logic runs as the fallback logic. If no rule for the tag is found, the fallback logic runs.

Since there is already a rule registered for <strong> this takes precedence.

Adding a custom rule that returns an empty string would probably work.


In the new V2 version (see the v2 branch) the remove logic already runs early:

package main

import (
    "fmt"

    "github.com/JohannesKaufmann/html-to-markdown/v2/converter"
    "github.com/JohannesKaufmann/html-to-markdown/v2/plugin/commonmark"
)

func main() {
    if err := run(); err != nil {
        panic(err)
    }
}

func run() error {
    input := `<p>This <strong>bold</strong> and <i>italic</i> text</p>`

    conv := converter.NewConverter(
        converter.WithPlugins(
            commonmark.NewCommonmarkPlugin(),
        ),
    )
    conv.Register.TagStrategy("strong", converter.StrategyRemoveNode)

    output, err := conv.ConvertString(input)
    if err != nil {
        return err
    }

    fmt.Println(output)
    return nil
}

The above code outputs "This and *italic* text"

kai687 commented 1 month ago

Ooooh. Indeed I didn't expect that. Thanks for the explanation!