JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
MIT License
845 stars 82 forks source link

🐛 Bug - Basic escape slashes innocent vertical pipes #86

Closed acook closed 2 weeks ago

acook commented 7 months ago

Vertical pipe characters are escaped no matter where they are in the line even though most (all?) markdown parsers don't treat this case specially.

I'm not sure if there are some parsers which treat any pipe as a table, if so then maybe a new escape mode is needed? But if not, then it should be possible to only escape pipes at the beginning of the line.

HTML Input

Foo | Bar

Generated Markdown

Foo \| Bar

Expected Markdown

Foo | Bar

Additional context

JohannesKaufmann commented 7 months ago

@acook yeah you are totally right! It would be great if the pipe character would be omitted in this case.

But with the "basic" escaping that is just not possible at the moment 🤷‍♂️


For V2 there will be a "smart" escaping. But that requires implementing some logic of a markdown parser. So basically we would only escape if it would actually be mistaken by a parser.

acook commented 7 months ago

The more I thought about it the more I realized that if the parser is very straight forward and efficient then it probably wouldn't have a concept of state for more than duplicates, so it might require some effort to rework.

Looking at the code now my first impression is that it kind of doing tr style escaping, and you're right, a more robust parser would be needed to handle these weird edge cases.

While I don't necessarily recommend this approach, it would be possible to set a flag by checking for a match for a pattern of pipe characters so that various table formats would be respected.

r := regexp.MustCompile("\|.*\n.*\|")
is_table := r.MatchString(cursor_at_current_line

if is_table {
  // do normal escaping
}

Using this basic test case, the first entry is not matched but the second and third are, meaning that it should be safe from escaping. There might be issues with other table formats, but this is a start, regardless of what eventual approach there might be!

Foo | Bar

Foo | Bar
---|---

|Foo | Bar
|---|---
JohannesKaufmann commented 2 weeks ago

On the "v2" branch are a lot of improvements — including a much better logic for escaping markdown characters.

It is still experimental but feel free to give it a try. Happy to hear about your experience 😊

I am going to close this issue. If you find anything with the new version, please open a new issue!