executablebooks / mdformat

CommonMark compliant Markdown formatter
https://mdformat.rtfd.io
MIT License
426 stars 44 forks source link

Needless URL encoding of link destinations #312

Open hukkin opened 2 years ago

hukkin commented 2 years ago
# From
- [Rénovateur de pannes](Rénovateur-de-pannes)
# To
- [Rénovateur de pannes](R%C3%A9novateur-de-pannes)

Changing the link character format is unneeded IHMO and makes the link less readable, less practical to correct if needed.

Originally posted by @mdeweerd in https://github.com/executablebooks/mdformat/issues/112#issuecomment-1055485853

ericholscher commented 2 years ago

I'm hitting this as well, and it's breaking my usage of the feature. I'm trying to use pelican and it's encoding things that break the syntax. For example:

As our [publisher policy]({filename}../publisher-policy.md) lays out:

Becomes

As our [publisher policy](%7Bfilename%7D../publisher-policy.md) lays out:

I'd love a no encoding option.

sanmai-NL commented 1 year ago
- [_MoSCoW_ 🗳️](#moscow-%EF%B8%8F)
- [_Task ✔️_](#task-%EF%B8%8F)
- [_🛡️ Security_](#%EF%B8%8F-security)

These emoji get %EF%B8%8F appended for some reason under mdformat 0.7.16.

kdeldycke commented 1 year ago

Same thing here, trying to format the Chinese translation of my awesome-iam project, which ends up like this:

- [Bloom Filter](https://zh.wikipedia.org/wiki/%E5%B8%83%E9%9A%86%E8%BF%87%E6%BB%A4%E5%99%A8)
diff --git readme.md readme.md
index 2ec585b..462846c 100644
--- readme.md
+++ readme.md
@@ -38,46 +38,46 @@

 <!-- mdformat-toc start --slug=github --no-anchors --maxlevel=6 --minlevel=2 -->

-- [概述](#概述)
-- [安全](#安全)
-- [账户管理](#账户管理)
-- [密码学](#密码学)
-  - [标识符](#标识符)
-- [零信任网络](#零信任网络)
-- [认证](#认证)
-  - [基于密码](#基于密码)
-  - [无密码](#无密码)
-  - [安全密钥](#安全密钥)
-  - [多因素](#多因素)
-  - [基于短信](#基于短信)
-  - [公钥基础设施](#公钥基础设施)
+- [概述](#%E6%A6%82%E8%BF%B0)
+- [安全](#%E5%AE%89%E5%85%A8)
+- [账户管理](#%E8%B4%A6%E6%88%B7%E7%AE%A1%E7%90%86)
+- [密码学](#%E5%AF%86%E7%A0%81%E5%AD%A6)
+  - [标识符](#%E6%A0%87%E8%AF%86%E7%AC%A6)
+- [零信任网络](#%E9%9B%B6%E4%BF%A1%E4%BB%BB%E7%BD%91%E7%BB%9C)
+- [认证](#%E8%AE%A4%E8%AF%81)
+  - [基于密码](#%E5%9F%BA%E4%BA%8E%E5%AF%86%E7%A0%81)
+  - [无密码](#%E6%97%A0%E5%AF%86%E7%A0%81)
+  - [安全密钥](#%E5%AE%89%E5%85%A8%E5%AF%86%E9%92%A5)
+  - [多因素](#%E5%A4%9A%E5%9B%A0%E7%B4%A0)
+  - [基于短信](#%E5%9F%BA%E4%BA%8E%E7%9F%AD%E4%BF%A1)
+  - [公钥基础设施](#%E5%85%AC%E9%92%A5%E5%9F%BA%E7%A1%80%E8%AE%BE%E6%96%BD)
   - [JWT](#jwt)
   - [OAuth2 & OpenID](#oauth2--openid)
   - [SAML](#saml)
-- [授权](#授权)
-  - [策略模型](#策略模型)
-  - [开源策略框架](#开源策略框架)
-  - [AWS 策略工具](#AWS-策略工具)
+- [授权](#%E6%8E%88%E6%9D%83)
+  - [策略模型](#%E7%AD%96%E7%95%A5%E6%A8%A1%E5%9E%8B)
+  - [开源策略框架](#%E5%BC%80%E6%BA%90%E7%AD%96%E7%95%A5%E6%A1%86%E6%9E%B6)
+  - [AWS 策略工具](#AWS-%E7%AD%96%E7%95%A5%E5%B7%A5%E5%85%B7)
   - [Macaroons](#macaroons)
-- [秘密管理](#秘密管理)
-  - [硬件安全模块 (HSM)](#硬件安全模块-hsm)
-- [信任与安全](#信任与安全)
-  - [用户身份](#用户身份)
-  - [欺诈](#欺诈)
+- [秘密管理](#%E7%A7%98%E5%AF%86%E7%AE%A1%E7%90%86)
+  - [硬件安全模块 (HSM)](#%E7%A1%AC%E4%BB%B6%E5%AE%89%E5%85%A8%E6%A8%A1%E5%9D%97-hsm)
+- [信任与安全](#%E4%BF%A1%E4%BB%BB%E4%B8%8E%E5%AE%89%E5%85%A8)
+  - [用户身份](#%E7%94%A8%E6%88%B7%E8%BA%AB%E4%BB%BD)
+  - [欺诈](#%E6%AC%BA%E8%AF%88)
   - [Moderation](#moderation)
-  - [威胁情报](#威胁情报)
-  - [验证码](#验证码)
-- [黑名单](#黑名单)
-  - [主机名和子域](#主机名和子域)
-  - [邮件](#邮件)
-  - [保留的 ID](#保留的-ID)
-  - [诽谤](#诽谤)
-- [隐私](#隐私)
-  - [匿名化](#匿名化)
+  - [威胁情报](#%E5%A8%81%E8%83%81%E6%83%85%E6%8A%A5)
+  - [验证码](#%E9%AA%8C%E8%AF%81%E7%A0%81)
+- [黑名单](#%E9%BB%91%E5%90%8D%E5%8D%95)
+  - [主机名和子域](#%E4%B8%BB%E6%9C%BA%E5%90%8D%E5%92%8C%E5%AD%90%E5%9F%9F)
+  - [邮件](#%E9%82%AE%E4%BB%B6)
+  - [保留的 ID](#%E4%BF%9D%E7%95%99%E7%9A%84-ID)
+  - [诽谤](#%E8%AF%BD%E8%B0%A4)
+- [隐私](#%E9%9A%90%E7%A7%81)
+  - [匿名化](#%E5%8C%BF%E5%90%8D%E5%8C%96)
   - [GDPR](#gdpr)
 - [UX/UI](#uxui)
-- [竞争分析](#竞争分析)
-- [历史](#历史)
+- [竞争分析](#%E7%AB%9E%E4%BA%89%E5%88%86%E6%9E%90)
+- [历史](#%E5%8E%86%E5%8F%B2)

 <!-- mdformat-toc end -->

Source: https://github.com/kdeldycke/awesome-iam/pull/100/files#diff-109f56ef9f23fd7bfdbf2e2c9a28b45bbe8160c71c6ee1f0f1439e0ea22103be

sanmai-NL commented 1 year ago

Please note that URIs cannot contain non-ASCII characters, so the fix is correct. But hopefully there's some middle ground or work-around.

https://bugs.ruby-lang.org/issues/12852

kdeldycke commented 1 year ago

Yes, maybe adding a --allow-iri or --allow-unicode-links to allow for Internationalized Resource Identifier instead or normalizing everything to URIs.

Unicode characters is extremely user-friendly for international content, both for readers and maintainers.

Note that Wikipedia renders all URLs with ASCII % escape codes in HTML, but let its links in MediaWiki syntax (like [[统一资源定位符]]) be written with unicode. You can check this out by trying to edit any non-english Wikipedia page.

I guess it is no unreasonable to let links and URLs in Markdown (a markup syntax) have unicode, and leave the rendering engine apply the appropriate URL encoding depending on the target (HTML, Latex, etc.).

mdeweerd commented 1 year ago

IMHO one of the goals of Markdown is to keep the source readable.

And mdformat helps to keep the source somewhat normalized.

One could argue that if the source is recognized by CommonMark reference implementations, then the source is acceptable.

Testing the chinese "links" at https://spec.commonmark.org/dingus/ shows that the reference implementation still shows a link. Our browsers are smart enough to URLencode the links before they are actually used - a target server will receive urlencoded links.

Converting the links specified in markdown to urlencoded links is technically correct because that is what a browser will do, but it makes the markdown source less readable and not a easy to adjust by a human. There is IMHO no technical need to urlencode the links in the markdown source.

We can also look at the commonmark specification, and more specifically the examples. In Example 31 we can see that html encoding is accepted in the link and it is only url encoded in the html rendering.

The CommonMark specification for a link destination does not require that links are URL encoded.

So in the end it's a matter of taste as both approaches are technically valid. Personnally I would not convert the link representation and prefer options modify that behavior to urlencode or urldecode links. I would probably have a preference to Urldecode links that are urlencoded to make them more readable to humans.