JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
MIT License
891 stars 85 forks source link

Fix punctuation with rules? #67

Closed lologhi closed 10 months ago

lologhi commented 1 year ago

Hi!

As I'm writing a scraper for a website, I'd like to fix some minor punctuation issues before saving the text, like when there are wrong spaces next to parenthesis like : Lorem ( ipsum dolor) sit amet or consectetur (adipiscing ) elit.

Do you think writing a converter rule (converter.AddRules) is the right solution to remove these king of error? I'd also like to replace some quotation mark, and add italic for quotations…

Hoping it's the right place for this kind of question! Best, Laurent

JohannesKaufmann commented 1 year ago

@lologhi can you post an example HTML snippet that causes problems with the parenthesis?

lologhi commented 1 year ago

Yes ! something like that for example:

<p>Au centre de l’Evangile de la liturgie d’aujourd’hui se trouvent les Béatitudes ( cf. Lc 6, 20-23).</p>

Instead of this markdown: Au centre de l’Evangile de la liturgie d’aujourd’hui se trouvent les Béatitudes ( cf. Lc 6, 20-23).

I’d like to have this (just one space removed because it’s not needed after an opening parentheses nor before a closing parentheses): Au centre de l’Evangile de la liturgie d’aujourd’hui se trouvent les Béatitudes (cf. Lc 6, 20-23).

Kind of a punctuation linter.

JohannesKaufmann commented 1 year ago

@lologhi You are right, a custom rule would work for that. Alternatively you could also register an After hook. But the rule is probably better...

By registering the rule for "#text" you can change all the text nodes in the document. With a regex (or going through the characters manually) you can change the text for you use-cases.

Note that the content variable will be empty, and you instead get the raw text from the node. Also note that the example below does not handle all edge-cases, see the default Commonmark rule for that.


html := `<p>Au centre de l’Evangile de la liturgie d’aujourd’hui se trouvent les Béatitudes ( cf. Lc 6, 20-23).</p>`

var r = regexp.MustCompile(`\(\s+`)

changeText := md.Rule{
    Filter: []string{"#text"},
    Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
        text := selec.Text()
        if trimmed := strings.TrimSpace(text); trimmed == "" {
            return md.String("")
        }

        text = r.ReplaceAllString(text, "(")

        // NOTE: See the #text rule for commonmark for all the
        // other logic that should happen here...

        return md.String(text)
    },
}

conv := md.NewConverter("", true, nil)
conv.AddRules(changeText)

markdown, err := conv.ConvertString(html)
if err != nil {
    log.Fatal(err)
}
fmt.Println(markdown)

Let me know if that works! And also what punctuation you ended up changing. Could be a cool plugin for the V2...

lologhi commented 1 year ago

Thanks a lot for this detailed example! I'm currently writing some punctuation rules (specifically for french, where we add non-breaking space before what we call "ponctuation double": !, ?, ; and :). I'll show you the result after some tests.

JohannesKaufmann commented 1 year ago

@lologhi were you able to create some logic for the french punctuation rules?

lologhi commented 1 year ago

Hey! I've worked a little bit on it yes, it's here, with the tests here. But it's more complicated than what I thought. Lots of regex should be written to look for something that is missing, and I'm unable to do that. For exemple, match all ; that does not have a non-breaking space before. So I look for all of them, add a non-breaking space, and then remove when there are too many spaces…