crosstype / node-html-markdown

Fast HTML to markdown converter for NodeJS or the browser
163 stars 28 forks source link

fix: Percent-encode Markdown reserved symbols in URLs #26

Closed AndrejsK closed 2 years ago

AndrejsK commented 2 years ago

While these symbols are generally considered valid in URLs, they are a part of Markdown syntax. This can cause problems in some MD renderers, like in Discord embeds, where URLs containing these symbols sometimes break.

Code example:

import { NodeHtmlMarkdown } from 'node-html-markdown'
import { NodeHtmlMarkdown as nhmFork } from 'node-html-markdown-fork'

const element1 = "<a href='https://lv.wikipedia.org/wiki/C%C4%93su_kauja_(1578)'>TEST</a>";
const element2 = "<i><a href=\"https://lv.wikipedia.org/wiki/Chandrayaan_1\" title=\"Chandrayaan 1\">Chandrayaan 1</a></i>";

console.log("Current master\n");
console.log(NodeHtmlMarkdown.translate(element1));
console.log(NodeHtmlMarkdown.translate(element2));

console.log("\nMy fork\n")
console.log(nhmFork.translate(element1));
console.log(nhmFork.translate(element2));

Output:

Current master

[TEST](https://lv.wikipedia.org/wiki/C%C4%93su_kauja_(1578))
_[Chandrayaan 1](https://lv.wikipedia.org/wiki/Chandrayaan1 "Chandrayaan 1")_

My fork

[TEST](https://lv.wikipedia.org/wiki/C%C4%93su%5Fkauja%5F%281578%29)
_[Chandrayaan 1](https://lv.wikipedia.org/wiki/Chandrayaan%5F1 "Chandrayaan 1")_

Currently the second example is just plain wrong - an underscore between Chandrayaan and 1 has gone missing and the URL is broken. The first example is less obvious, but the brackets at the end confuse Discord on Android, while the encoded version works just fine.

Since I was already making this change, I also chose to encode the * symbol, as it is also significant in Markdown and could cause problems.

Test results:

yarn run v1.22.10
$ jest
 PASS  test/default-tags-codeblock.test.ts
 PASS  test/options.test.ts
 PASS  test/default-tags.test.ts
 PASS  test/special-cases.test.ts

Test Suites: 4 passed, 4 total
Tests:       54 passed, 54 total
Snapshots:   0 total
Time:        2.456 s, estimated 3 s
Ran all test suites.
Done in 3.56s.

Benchmark:

yarn run v1.22.10
$ cd benchmark && yarn run benchmark quick
$ node execute.js quick

-----------------------------------------------------------------------------

node-html-makrdown (reused instance): 16.0300 ms/file ± 9.15999 (5.8 MB/s)
node-html-markdown                  : 16.3114 ms/file ± 9.81729 (5.7 MB/s)
turndown (reused instance)          : 34.5944 ms/file ± 18.5378 (2.69 MB/s)
turndown                            : 34.9459 ms/file ± 18.7913 (2.66 MB/s)

-----------------------------------------------------------------------------

Total Files: 25
Avg. file size: 95.21 kB

-----------------------------------------------------------------------------

Estimated processing times (fastest to slowest):

  [node-html-makrdown (reused instance)]
    100 kB:  17ms
    1 MB:    172ms
    50 MB:   8.62sec
    1 GB:    2min, 57sec
    50 GB:   2hr, 27min, 7sec

  [node-html-markdown]
    100 kB:  17ms
    1 MB:    175ms
    50 MB:   8.77sec
    1 GB:    2min, 60sec
    50 GB:   2hr, 29min, 42sec

  [turndown (reused instance)]
    100 kB:  36ms
    1 MB:    372ms
    50 MB:   18.60sec
    1 GB:    6min, 21sec
    50 GB:   5hr, 17min, 30sec

  [turndown]
    100 kB:  37ms
    1 MB:    376ms
    50 MB:   18.79sec
    1 GB:    6min, 25sec
    50 GB:   5hr, 20min, 44sec

-----------------------------------------------------------------------------

Speed comparison - node-html-makrdown (reused instance) is: 

  1.02 times as fast as node-html-markdown
  2.16 times as fast as turndown (reused instance)
  2.18 times as fast as turndown

-----------------------------------------------------------------------------

Done in 3.80s.
AndrejsK commented 2 years ago

Requested changes made.

nonara commented 2 years ago

Released in v1.1.3