Cimbali / CleanLinks

Converts obfuscated/nested links to genuine clean links.
https://addons.mozilla.org/en-GB/firefox/addon/clean-links-webext/
Mozilla Public License 2.0
76 stars 2 forks source link

Bing Result Rewrite #148

Open OdinVex opened 2 years ago

OdinVex commented 2 years ago

Bing has taken to obfuscating their search results.

Example: https://www.bing.com/fd/ls/GLinkPing.aspx?IG=9AFD7A29FFBB46F1B9A81FF058C0640E&&ID=SERP,5206.1&url=https%3A%2F%2Fwww.bing.com%2Fck%2Fa%3F!%26%26p%3D0d3eae76a1129f5f677a93348d0d5d6ee2f5906c36c38d0cad86b467db7afa8aJmltdHM9MTY1MjkzNTY0NSZpZ3VpZD05YWZkN2EyOS1mZmJiLTQ2ZjEtYjlhOC0xZmYwNThjMDY0MGUmaW5zaWQ9NTIwNg%26ptn%3D3%26fclid%3Dc74e5d8f-d72e-11ec-a4f0-c181b1119cc1%26u%3Da1aHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQ%26ntb%3D1

Becomes: https://www.bing.com/ck/a?!&&p=db4b3ebd2e9cc9b2ea00045df2468659404c3b0196457cff9450c3c8c5a7e73dJmltdHM9MTY1MjkxMTQ0MiZpZ3VpZD0yMGZhZDE5My0zNTMxLTRiNWEtOGFmOC1mYzdiODUyODRlOTEmaW5zaWQ9NTIyMA&ptn=3&fclid=6cea1929-d6f6-11ec-8509-a5dffd6b7157&u=a1aHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQ&ntb=1

Should be: https://www.youtube.com/watch?v=XHztkFCzbTU

The section halfway through the p variable to the end is base64-encoded, &imts=1652911442&iguid=20fad193-3531-4b5a-8af8-fc7b85284e91&insid=5220. They're stepping up their crap. Just strip the first two characters of the u paramater and the rest, HR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQis the URL base64-encoded. :)

Edit: I know it's a bit beyond the scope of Clean Link's typical parameter allow/block-listing, but I think it'd be helpful to evolve it?

OdinVex commented 2 years ago

I've determined that the base64 characters assumed by Clean Links is incomplete. While RFC4648 states base 64 encoding is A-Za-z0-9+/=, it also notes in (Section 5) Base 64 Encoding with URL and Filename Safe Alphabet that a base64url with URL and Filename Safe Alphabet does not use + and /, but rather uses - and _ (**nothing changes, except instead of using + and / you'd use - and _ in their places). You could extend const base64_encoded_url = /\b(?:aHR0|d3d3)[A-Z0-9+=/]+/iu to include the characters to handle both, though it might be better to separate them as base 64 encoding and base 64 url encoding, to be proper.

OdinVex commented 2 years ago

On a separate note, the pre-existing base 64 encoding method is not base 64, it is base 32 because it does not also use a-z. See chapters 4, 5, 6, 7, and if you think it might be beneficial, 8 as well.

OdinVex commented 2 years ago

All in all, for base64, you should use /\b(?:aHR0|d3d3)[A-Za-z0-9+=/]+/u for base 64 encoding, /\b(?:aHR0|d3d3)[A-Za-z0-9\-=_]+/u for base 64 url encoding, and then decide if you'd like to also support base 32 (which is what you're actually supporting right now based on character set not including regex flag, kinda iffy way to write it), 16, et cetera. (You're using case-insensitivity, I'm sure, I wonder if it might be faster without while specifying characters, I don't know, I don't do much for optimizing performance of things.)

Cimbali commented 2 years ago

We’re matching base64 currently, otherwise the encoding for http and www wouldn’t start with aHR0 and d3d3. You might be confused by the fact that the regex is case insensitive. I traced that back to the original XUL source code − not sure why it’s there but certainly not a performance concern.

On the use of url-base64 instead of normal base64, that would make more sense I agree, but CleanLinks has been built to decode what websites use − not what websites should use. Since browser’s atob and btoa − and likely other languages base64 functions − use normal base 64 encoding, that’s what was found in the wild.

It seems to be the case that bing uses proper url-base64, in which case we need to:

  1. Add support for URL base 64, which is a new regex matching and doing something like atob(base64match.replace('+', '-').replace('_', '/'))
  2. A rewrite rule on https://www.bing.com/ck/ to strip the first 2 characters of the u= parameter

That should be enough for the usual mechanisms to kick in. If you’re willing to write a PR I can make some time to look at it.

Frankly, I don’t think we can assume that any arbitrary string containing aHR0 is a link. CL is already breaking enough websites as it is, let’s not add more spurious behaviour.

OdinVex commented 2 years ago

We’re matching base64 currently, otherwise the encoding for http and www wouldn’t start with aHR0 and d3d3. You might be confused by the fact that the regex is case insensitive. I traced that back to the original XUL source code − not sure why it’s there but certainly not a performance concern.

Yes and no. Your regex statement relies on case-insensitivity, instead of more properly defining the limited relationship of characters of the algorithm. It's slightly vague to rely on a regex flag separate from the actual characters list to denote character-space. You should turn off insensitivity and simply include the characters. It'd might be slightly faster if the backend routines don't need to attempt case-insensitivity matching. As for performance, I'm saying it might be slightly faster, is all. Maybe miniscule difference, just merely trying to be conscious of it.

On the use of url-base64 instead of normal base64, that would make more sense I agree, but CleanLinks has been built to decode what websites use − not what websites should use. Since browser’s atob and btoa − and likely other languages base64 functions − use normal base 64 encoding, that’s what was found in the wild.

Bing is using it above. It's also proper to use it according to RFC. It wouldn't hurt to simply have it. I don't know if you can use the replace statements, might it not throw off the math used to code? You may need to verify if simply replacing works.

That should be enough for the usual mechanisms to kick in. If you’re willing to write a PR I can make some time to look at it.

I've already attempted to write a rule, but my ability to make rules with Clean Links fails miserably, I've a hard time using this rules engine.

Frankly, I don’t think we can assume that any arbitrary string containing aHR0 is a link. CL is already breaking enough websites as it is, let’s not add more spurious behaviour.

I don't think it is spurious. If anything, it's more proper to use correct encoding for file/url than not.

Cimbali commented 2 years ago

Yes and no.

We match strings starting with aHR0 and d3d3, which correspond to base64-encoded prefixes www and http. Base-32 prefixes are O53XO and NB2HI4, which we don’t match. Case insensitivity just means we accidentally match starts like AhR0 too. It doesn’t affect performance to a measurable degree, and it’s also not been a concern so far, the regex itself is just legacy.

Bing is using it above.

Yes, so we can now support it. So far websites just percent-encoded the base64-encoded string including + and / just like they would any string passed into a URL.

I've a hard time using this rules engine.

It’s been a challenge writing something that works but I’m even worse with interfaces. Not sure how to make it more user-friendly, especially with limited resources I can spare at the moment. I think the wiki is a good read for starters but still pretty limited and doesn’t include latest features like search-replace.

Also it seems the search-replace only applies to the path part of the URL and we’d be interested in modifying a parameter here.

I don't think it is spurious. If anything, it's more proper to use correct encoding for file/url than not.

Decoding random strings like FooBaraHR0bazQux into URLs would be spurious. Decoding the specific url-alphabet base-64 is fine.

OdinVex commented 2 years ago

We match strings starting with aHR0 and d3d3, which correspond to base64-encoded prefixes www and http. Base-32 prefixes are O53XO and NB2HI4, which we don’t match. Case insensitivity just means we accidentally match starts like AhR0 too. It doesn’t affect performance to a measurable degree, and it’s also not been a concern so far, the regex itself is just legacy.

Again, that's not what I'm talking about. Your regex statement (ignoring flag) is b32. With the extra a-z considered from the flag, it functions b64. It's just poorly written. As for performance, I understand it is almost completely inconsequential, I merely thought to do the best job the first time around.

It’s been a challenge writing something that works but I’m even worse with interfaces. Not sure how to make it more user-friendly, especially with limited resources I can spare at the moment. I think the wiki is a good read for starters but still pretty limited and doesn’t include latest features like search-replace.

Oh, it's quite alright, I understand the whole 'I'm a programmer, not a writer' plea, I'm the same way. “Want documentation? Read the code!” Heh. I tried the wiki but...it was almost useless for me.

Decoding random strings like FooBaraHR0bazQux into URLs would be spurious. Decoding the specific url-alphabet base-64 is fine.

Oh, I thought you meant adding b64, b64url, b32, et cetera. My bad.

OdinVex commented 2 years ago

Curious...does Clean Links get a final path to work with such as any links that were HTTP becoming http or not?

Cimbali commented 2 years ago

Both http://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t and https://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t convert to http://example.com/, if that’s what you’re asking about. If the embedded URL starts with www we inherit the connection, so http://track.com/redir?u=d3d3LmV4YW1wbGUuY29t converts to http://www.example.com/ and https://track.com/redir?u=d3d3LmV4YW1wbGUuY29t converts to https://www.example.com/.

OdinVex commented 2 years ago

Both http://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t and https://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t convert to http://example.com/, if that’s what you’re asking about. If the embedded URL starts with www we inherit the connection, so http://track.com/redir?u=d3d3LmV4YW1wbGUuY29t converts to http://www.example.com/ and https://track.com/redir?u=d3d3LmV4YW1wbGUuY29t converts to https://www.example.com/.

I meant, if someone wrote HtTp://, does CleanLinks get a nice http://?

OdinVex commented 1 year ago

Follow-Up: I ended up removing the atob/btoa routines, they're not capable of handling RFC-4648 base64url unless you swap +, / with -, _ respectively. See the Wikipedia article on Base64 for more information as to what I'm talking about. YouTube and Bing both use this. CleanLinks wasn't working properly for both for a while now. (YouTube for about three months has been screwing up, but I thought it was a removed video that was redirecting to their home page, I so rarely use YouTube, didn't notice.) Bing was a partial. All my links are fixed and not truncated anymore, though the highlighting in the lists is still incorrect.

Edit: If you swap, they decode correctly, and you can simply add the other two Base64Url coding characters into the regex. I went with (in testing, could be handled better with the whole bing/youtube thing):

In /addon/modules/cleanlink.js:

const base64_encoded_url = /(?:aHR0|d3d3)[A-Z0-9+=/\-_]+/iu Inside function codeblock function decode_embedded_uri(link, rules, original_string):

const [base64match] = str.match(base64_encoded_url) || str.substring(2).match(base64_encoded_url) || [];
...
decoded = decodeURIComponent(atob(base64match.replace('+', '-').replace('_', '/')));
OdinVex commented 1 year ago

Bing is now taking to using relative links: L2ltYWdlcy9zZWFyY2g/cT0uLi4mRk9STT1IRFJTQzM==>/images/search?q=...&FORM=HDRSC3 as an example. If the forward-slash is reliably always encoded as 'L', perhaps that can be used to whittle down how much testing can be used for parameters to detect URLs.