Cimbali / CleanLinks

Converts obfuscated/nested links to genuine clean links.
https://addons.mozilla.org/en-GB/firefox/addon/clean-links-webext/
Mozilla Public License 2.0
76 stars 2 forks source link

`:` used in URLs #105

Closed Cimbali closed 4 years ago

Cimbali commented 4 years ago

wikimedia uses colons : in their URLs, however disqus uses a trailing : as seen in #49. Two examples we have right now:

https://disq.us/url?url=https%3A%2F%2Fscholarlyoa.com%2Fpublishers%2F%3A-EibzAO-QGxTovjeNTBl4GVHW68&cuid=1072384 https://disq.us/url?url=https%3A%2F%2Fwww.cnbc.com%2F2018%2F10%2F28%2Fibm-to-acquire-red-hat-in-deal-valued-at-34-billion.html%3A-q99Da4jh5BlAfPTV7GrgJ4rKaU&cuid=1319929

So the current solution, which is to remove colon from the allowed characters post-rewrite, breaks wikimedia pages. What I see here is that:

The main options are:

  1. slightly amended path rewrite (it is very lightly used right now, only amazon and tripadvisor) to apply search/replace on the path + query.
  2. make a new type of rule, e.g. search/replace on query parameters
  3. make a different rule type, post-de-embeddeding, i.e. match on the containing link and apply on the embedded link
birdie-github commented 4 years ago

I don't see :- being used any longer however all disqus redirects contain ?cuid=PUBLISHER_ID.

Cimbali commented 4 years ago

Well that’s good news. It will solve the problem pretty easily.

Cimbali commented 4 years ago

Also @birdie-github can you provide an example of this cuid being used? Is it after being redirected or before? If it’s a redirect we should be skipping it anyway, wo?

birdie-github commented 4 years ago

Just check out the comments section here - contains a ton or links or any news piece here.

Cimbali commented 4 years ago

Hmmm, even with all add-ons disabled I can’t seem to find any request containing cuid.

birdie-github commented 4 years ago

I don't know:

https://disq.us/url?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D206175%3Ajl2Ouee8NpzLG3OiybfV3xqN9_U&cuid=1319929

https://disq.us/url?url=https%3A%2F%2Finterworks.com%2Fwp-content%2Fuploads%2Fsites%2Fdefault%2Ffiles%2Fblog%2Fu34%2FBlog-rdp-ss5.png%3A7M3TR8p5PbUMRlYCcRLh0Tr-wwk&cuid=1319929 http://disq.us/url?url=http%3A%2F%2FX.org%3AbDT8El1wWYoSCgh_DypT03lvl74&cuid=1319929

https://disq.us/url?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Flinux%2Fcomments%2Fcmg48b%2Flets_talk_about_the_elephant_in_the_room_the%2F%3Aa67XTuqAJPIDvCZIzjy3rCskDso&cuid=1319929

https://disq.us/url?url=https%3A%2F%2Ffedoraproject.org%2Fwiki%2FChanges%2FEnableEarlyoom%3AfLTwGESsvWMx9ZBSsJynUN_w0lE&cuid=1319929

&cuid=NUMBER is used for every link.

birdie-github commented 4 years ago

It's even in your own original bug report :-)

Cimbali commented 4 years ago

Thanks ! I’ve taken those links from previous bug reports.

I thought we were looking at requests not links so I was looking in the network log… my bad. Anyway, those cuid don’t matter as CleaanLinks will redirect the disq.us request to the embedded URL.