JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
MIT License
845 stars 82 forks source link

🐛 Bug: spurious whitespace insertion #95

Closed ashishb closed 2 weeks ago

ashishb commented 5 months ago

Describe the bug A clear and concise description of what the bug is.

HTML Input

<p><a href="https://example.com">Hello world</a>. "<a href="https://example.com">Example 1</a>".</p>

Generated Markdown

[Hello world](https://example.com). " [Example 1](https://example.com)".

Expected Markdown

[Hello world](https://example.com). "[Example 1](https://example.com)".

Additional context Spurious whitespace is inserted before links

Ref: https://github.com/ashishb/wp2hugo/issues/11

JohannesKaufmann commented 5 months ago

Hi @ashishb

yeah you are totally right, that looks like a bug. It is probably related to the AddSpaceIfNessesary function 🤔

However I am currently not actively working on V1 of the library.

My focus is on the next version of the library. With V2 this will be fixed (see demo). But it is not ready yet, for example the outside API is not finished & most plugins are still missing.

Will keep this issue open for now. If you want to work on this in the meantime, I happily accept PRs!

Greetings, Johannes

ashishb commented 5 months ago

@JohannesKaufmann happy to send a PR to fix things, can you give me some more guidance on what might be happening here? Alternatively, should I switch to v2 instead?

JohannesKaufmann commented 5 months ago

happy to send a PR to fix things

That would be amazing!

more guidance on what might be happening here?

The <a> node uses the function AddSpaceIfNessesary to look at the previous & next nodes to decide if an extra space before & after the brackets is needed.

Right now it just checks for unicode.IsSpace but maybe we could also check for unicode.IsPunct.

// AddSpaceIfNessesary adds spaces to the text based on the neighbors.
// That makes sure that there is always a space to the side, to recognize the delimiter.
func AddSpaceIfNessesary(selec *goquery.Selection, markdown string) string {

You can slightly change the logic (for example: what happens if AddSpaceIfNessesary is not used for <a> nodes) and then run go test -update to see if the golden files changed (more info).

It is important to add a few more testcases & also make sure that there aren't any regressions...

Alternatively, should I switch to v2 instead?

In the v2 I found a better way to solve this problem. So the AddSpaceIfNessesary function is not used anymore.

But it won't be released in the next few months...

ashishb commented 2 months ago

@JohannesKaufmann I would have gladly sent a PR but I dislike the idea of fixing V1 if V2 of your awesome library is around the corner.

ashishb commented 1 month ago

@JohannesKaufmann when is the new library being planned to be released on?

JohannesKaufmann commented 1 month ago

It's in the final stretch. But I can't say exactly when it will be published

JohannesKaufmann commented 2 weeks ago

On the "v2" branch are a lot of improvements — including a fix to this bug.

It is still experimental but feel free to give it a try. Happy to hear about your experience 😊

I am going to close this issue. If you find anything with the new version, please open a new issue!