Closed birdie-github closed 4 years ago
The cleaned link is:
https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html:-q99Da4jh5BlAfPTV7GrgJ4rKaU
and the desired result is:
https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html
The column :
is a reserved character in URLs, so we should be able to add it as a delimiter to strip after .html
. The incriminated code:
I've had a closer look into what characters are allowed, from RFC 1738:
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
[...]
alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation = "<" | ">" | "#" | "%" | <">
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex
unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
So all the unreserved and escape characters are allowed, and the reserved characters are either used as delimiters (/?
) or explicitly listed as allowed (;:@&=
). This means the only forbidden characters in a URL are: {}|\^~[]`<>#"
.
In particular, this means that https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html:-q99Da4jh5BlAfPTV7GrgJ4rKaU
is a fully valid URL.
When #27 is done and we have proper per-domain rules, we can probably match disq.us links and remove a trailing :-[0-9a-fA-F._-]
from the path.
This is currently solved by considering a colon :
is an incorrect character. If there are URLs where this character is used unencoded, they will break, and we’ll have to look into that.
Here's an example link:
https://disq.us/url?url=https%3A%2F%2Fwww.cnbc.com%2F2018%2F10%2F28%2Fibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html%3A-q99Da4jh5BlAfPTV7GrgJ4rKaU&cuid=1319929
Mind that disqus uses ":" %3A as a URL delimiter which doesn't quite work with the Clean Links add-on.