best-of-lists / best-of-generator

🏆 Generates a ranked list of awesome libraries and tools.
https://best-of.org
GNU General Public License v3.0
74 stars 13 forks source link

Non-English category causes invalid markdown link #65

Closed YDX-2147483647 closed 1 year ago

YDX-2147483647 commented 1 year ago

Note

Most GitHub projects use the Latin alphabet. So this issue can be given a low priority.

Describe the bug:

If I have a non-English category (e.g. ), then the TOC will be generated as [类](#), whose href is empty.

Expected behaviour:

Render it as [类](#类) or [类](#category-id).

Steps to reproduce the issue:

(I've described it above)

👇 [网站](#).

👇 GitHub's anchor is #网站.

Technical details:

Possible Fix:

Our process_md_link differs from GitHub's.

https://github.com/best-of-lists/best-of-generator/blob/4e07c02a36d964c28ceab6de53c74be84a633286/src/best_of/generators/markdown_list.py#L486-L488

GitHub's algorithm is not documented, but people have discussed it at https://gist.github.com/asabaylus/3071099. In short, CJK and other Unicode characters matter.

https://gist.github.com/asabaylus/3071099?permalink_comment_id=1593627#gistcomment-1593627

The code that creates the anchors:

  1. It downcases the string
  2. remove anything that is not a letter, number, space or hyphen (see the source for how Unicode is handled)
  3. changes any space to a hyphen.
  4. If that is not unique, add "-1", "-2", "-3",... to make it unique

https://gist.github.com/asabaylus/3071099?permalink_comment_id=2563127#gistcomment-2563127

    text = text.lower().replace(" ", "-")
    text = re.compile(r"[`~!@#$%^&*()+=<>?,./:;"'|{}\[\]\\–—]").sub("", text)
    text = re.compile(r"[ 。?!,、;:“”【】()〔〕[]﹃﹄“”‘’﹁﹂—…-~《》〈〉「」]").sub("", text) # CJK punctuation
    return text

Additional context:

Relying on GitHub's tricky algorithm may be a bad idea, and we can use category IDs.

<h2 id='category-id'>Category Title</h2>

[Category Title](#category-id)

(<a id /> trick does not work.)

I tried to make a PR, but title_md_prefix: str = "##" is not compatible to <h2>. However, no code calls them with title_md_prefix. Can I make it private?

https://github.com/best-of-lists/best-of-generator/blob/4e07c02a36d964c28ceab6de53c74be84a633286/src/best_of/generators/markdown_list.py#L334-L336

https://github.com/best-of-lists/best-of-generator/blob/4e07c02a36d964c28ceab6de53c74be84a633286/src/best_of/generators/markdown_list.py#L437-L439