Properly cleaning disqus comments outgoing links

birdie-github commented 6 years ago

Here's an example link:

https://disq.us/url?url=https%3A%2F%2Fwww.cnbc.com%2F2018%2F10%2F28%2Fibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html%3A-q99Da4jh5BlAfPTV7GrgJ4rKaU&cuid=1319929

Mind that disqus uses ":" %3A as a URL delimiter which doesn't quite work with the Clean Links add-on.

Cimbali commented 6 years ago

The cleaned link is: https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html:-q99Da4jh5BlAfPTV7GrgJ4rKaU and the desired result is: https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html

The column : is a reserved character in URLs, so we should be able to add it as a delimiter to strip after .html. The incriminated code:

https://github.com/Cimbali/CleanLinks/blob/78a6b8e31db5e3f4fba1c178d65a0cd8d82f2e2c/addon/cleanlink.js#L295-L300

Cimbali commented 5 years ago

I've had a closer look into what characters are allowed, from RFC 1738:

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

[...]

alpha          = lowalpha | hialpha
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">

reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape

So all the unreserved and escape characters are allowed, and the reserved characters are either used as delimiters (/?) or explicitly listed as allowed (;:@&=). This means the only forbidden characters in a URL are: {}|\^~[]`<>#".

In particular, this means that https://www.cnbc.com/2018/10/28/ibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html:-q99Da4jh5BlAfPTV7GrgJ4rKaU is a fully valid URL.

When #27 is done and we have proper per-domain rules, we can probably match disq.us links and remove a trailing :-[0-9a-fA-F._-] from the path.

Cimbali commented 4 years ago

This is currently solved by considering a colon : is an incorrect character. If there are URLs where this character is used unencoded, they will break, and we’ll have to look into that.

Cimbali / CleanLinks

Properly cleaning disqus comments outgoing links #49