Open OdinVex opened 2 years ago
I've determined that the base64 characters assumed by Clean Links is incomplete. While RFC4648 states base 64 encoding is A-Za-z0-9+/=
, it also notes in (Section 5) Base 64 Encoding with URL and Filename Safe Alphabet
that a base64url
with URL and Filename Safe Alphabet does not use +
and /
, but rather uses -
and _
(**nothing changes, except instead of using +
and /
you'd use -
and _
in their places). You could extend const base64_encoded_url = /\b(?:aHR0|d3d3)[A-Z0-9+=/]+/iu
to include the characters to handle both, though it might be better to separate them as base 64 encoding
and base 64 url encoding
, to be proper.
On a separate note, the pre-existing base 64 encoding method is not base 64, it is base 32 because it does not also use a-z
. See chapters 4, 5, 6, 7, and if you think it might be beneficial, 8 as well.
All in all, for base64, you should use /\b(?:aHR0|d3d3)[A-Za-z0-9+=/]+/u
for base 64 encoding, /\b(?:aHR0|d3d3)[A-Za-z0-9\-=_]+/u
for base 64 url encoding, and then decide if you'd like to also support base 32 (which is what you're actually supporting right now based on character set not including regex flag, kinda iffy way to write it), 16, et cetera. (You're using case-insensitivity, I'm sure, I wonder if it might be faster without while specifying characters, I don't know, I don't do much for optimizing performance of things.)
We’re matching base64 currently, otherwise the encoding for http and www wouldn’t start with aHR0
and d3d3
. You might be confused by the fact that the regex is case insensitive. I traced that back to the original XUL source code − not sure why it’s there but certainly not a performance concern.
On the use of url-base64 instead of normal base64, that would make more sense I agree, but CleanLinks has been built to decode what websites use − not what websites should use. Since browser’s atob
and btoa
− and likely other languages base64 functions − use normal base 64 encoding, that’s what was found in the wild.
It seems to be the case that bing uses proper url-base64, in which case we need to:
atob(base64match.replace('+', '-').replace('_', '/'))
https://www.bing.com/ck/
to strip the first 2 characters of the u=
parameterThat should be enough for the usual mechanisms to kick in. If you’re willing to write a PR I can make some time to look at it.
Frankly, I don’t think we can assume that any arbitrary string containing aHR0
is a link. CL is already breaking enough websites as it is, let’s not add more spurious behaviour.
We’re matching base64 currently, otherwise the encoding for http and www wouldn’t start with
aHR0
andd3d3
. You might be confused by the fact that the regex is case insensitive. I traced that back to the original XUL source code − not sure why it’s there but certainly not a performance concern.
Yes and no. Your regex statement relies on case-insensitivity, instead of more properly defining the limited relationship of characters of the algorithm. It's slightly vague to rely on a regex flag separate from the actual characters list to denote character-space. You should turn off insensitivity and simply include the characters. It'd might be slightly faster if the backend routines don't need to attempt case-insensitivity matching. As for performance, I'm saying it might be slightly faster, is all. Maybe miniscule difference, just merely trying to be conscious of it.
On the use of url-base64 instead of normal base64, that would make more sense I agree, but CleanLinks has been built to decode what websites use − not what websites should use. Since browser’s
atob
andbtoa
− and likely other languages base64 functions − use normal base 64 encoding, that’s what was found in the wild.
Bing is using it above. It's also proper to use it according to RFC. It wouldn't hurt to simply have it. I don't know if you can use the replace
statements, might it not throw off the math used to code? You may need to verify if simply replacing works.
That should be enough for the usual mechanisms to kick in. If you’re willing to write a PR I can make some time to look at it.
I've already attempted to write a rule, but my ability to make rules with Clean Links
fails miserably, I've a hard time using this rules engine.
Frankly, I don’t think we can assume that any arbitrary string containing
aHR0
is a link. CL is already breaking enough websites as it is, let’s not add more spurious behaviour.
I don't think it is spurious. If anything, it's more proper to use correct encoding for file/url than not.
Yes and no.
We match strings starting with aHR0
and d3d3
, which correspond to base64-encoded prefixes www
and http
. Base-32 prefixes are O53XO
and NB2HI4
, which we don’t match. Case insensitivity just means we accidentally match starts like AhR0
too. It doesn’t affect performance to a measurable degree, and it’s also not been a concern so far, the regex itself is just legacy.
Bing is using it above.
Yes, so we can now support it. So far websites just percent-encoded the base64-encoded string including + and / just like they would any string passed into a URL.
I've a hard time using this rules engine.
It’s been a challenge writing something that works but I’m even worse with interfaces. Not sure how to make it more user-friendly, especially with limited resources I can spare at the moment. I think the wiki is a good read for starters but still pretty limited and doesn’t include latest features like search-replace.
Also it seems the search-replace only applies to the path part of the URL and we’d be interested in modifying a parameter here.
I don't think it is spurious. If anything, it's more proper to use correct encoding for file/url than not.
Decoding random strings like FooBaraHR0bazQux
into URLs would be spurious. Decoding the specific url-alphabet base-64 is fine.
We match strings starting with
aHR0
andd3d3
, which correspond to base64-encoded prefixeswww
andhttp
. Base-32 prefixes areO53XO
andNB2HI4
, which we don’t match. Case insensitivity just means we accidentally match starts likeAhR0
too. It doesn’t affect performance to a measurable degree, and it’s also not been a concern so far, the regex itself is just legacy.
Again, that's not what I'm talking about. Your regex statement (ignoring flag) is b32. With the extra a-z considered from the flag, it functions b64. It's just poorly written. As for performance, I understand it is almost completely inconsequential, I merely thought to do the best job the first time around.
It’s been a challenge writing something that works but I’m even worse with interfaces. Not sure how to make it more user-friendly, especially with limited resources I can spare at the moment. I think the wiki is a good read for starters but still pretty limited and doesn’t include latest features like search-replace.
Oh, it's quite alright, I understand the whole 'I'm a programmer, not a writer' plea, I'm the same way. “Want documentation? Read the code!” Heh. I tried the wiki but...it was almost useless for me.
Decoding random strings like
FooBaraHR0bazQux
into URLs would be spurious. Decoding the specific url-alphabet base-64 is fine.
Oh, I thought you meant adding b64, b64url, b32, et cetera. My bad.
Curious...does Clean Links get a final path to work with such as any links that were HTTP becoming http or not?
Both http://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t
and https://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t
convert to http://example.com/
, if that’s what you’re asking about. If the embedded URL starts with www
we inherit the connection, so http://track.com/redir?u=d3d3LmV4YW1wbGUuY29t
converts to http://www.example.com/
and https://track.com/redir?u=d3d3LmV4YW1wbGUuY29t
converts to https://www.example.com/
.
Both
http://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t
andhttps://track.com/redir?u=aHR0cDovL2V4YW1wbGUuY29t
convert tohttp://example.com/
, if that’s what you’re asking about. If the embedded URL starts withwww
we inherit the connection, sohttp://track.com/redir?u=d3d3LmV4YW1wbGUuY29t
converts tohttp://www.example.com/
andhttps://track.com/redir?u=d3d3LmV4YW1wbGUuY29t
converts tohttps://www.example.com/
.
I meant, if someone wrote HtTp://
, does CleanLinks get a nice http://
?
Follow-Up: I ended up removing the atob/btoa routines, they're not capable of handling RFC-4648 base64url unless you swap +
, /
with -
, _
respectively. See the Wikipedia article on Base64 for more information as to what I'm talking about. YouTube and Bing both use this. CleanLinks wasn't working properly for both for a while now. (YouTube for about three months has been screwing up, but I thought it was a removed video that was redirecting to their home page, I so rarely use YouTube, didn't notice.) Bing was a partial. All my links are fixed and not truncated anymore, though the highlighting in the lists is still incorrect.
Edit: If you swap, they decode correctly, and you can simply add the other two Base64Url coding characters into the regex. I went with (in testing, could be handled better with the whole bing/youtube thing):
In /addon/modules/cleanlink.js
:
const base64_encoded_url = /(?:aHR0|d3d3)[A-Z0-9+=/\-_]+/iu
Inside function codeblock function decode_embedded_uri(link, rules, original_string)
:
const [base64match] = str.match(base64_encoded_url) || str.substring(2).match(base64_encoded_url) || [];
...
decoded = decodeURIComponent(atob(base64match.replace('+', '-').replace('_', '/')));
Bing is now taking to using relative links:
L2ltYWdlcy9zZWFyY2g/cT0uLi4mRk9STT1IRFJTQzM=
=>/images/search?q=...&FORM=HDRSC3
as an example. If the forward-slash is reliably always encoded as 'L', perhaps that can be used to whittle down how much testing can be used for parameters to detect URLs.
Bing has taken to obfuscating their search results.
Example:
https://www.bing.com/fd/ls/GLinkPing.aspx?IG=9AFD7A29FFBB46F1B9A81FF058C0640E&&ID=SERP,5206.1&url=https%3A%2F%2Fwww.bing.com%2Fck%2Fa%3F!%26%26p%3D0d3eae76a1129f5f677a93348d0d5d6ee2f5906c36c38d0cad86b467db7afa8aJmltdHM9MTY1MjkzNTY0NSZpZ3VpZD05YWZkN2EyOS1mZmJiLTQ2ZjEtYjlhOC0xZmYwNThjMDY0MGUmaW5zaWQ9NTIwNg%26ptn%3D3%26fclid%3Dc74e5d8f-d72e-11ec-a4f0-c181b1119cc1%26u%3Da1aHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQ%26ntb%3D1
Becomes:
https://www.bing.com/ck/a?!&&p=db4b3ebd2e9cc9b2ea00045df2468659404c3b0196457cff9450c3c8c5a7e73dJmltdHM9MTY1MjkxMTQ0MiZpZ3VpZD0yMGZhZDE5My0zNTMxLTRiNWEtOGFmOC1mYzdiODUyODRlOTEmaW5zaWQ9NTIyMA&ptn=3&fclid=6cea1929-d6f6-11ec-8509-a5dffd6b7157&u=a1aHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQ&ntb=1
Should be:
https://www.youtube.com/watch?v=XHztkFCzbTU
The section halfway through the
p
variable to the end is base64-encoded,&imts=1652911442&iguid=20fad193-3531-4b5a-8af8-fc7b85284e91&insid=5220
. They're stepping up their crap. Just strip the first two characters of theu
paramater and the rest,HR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g_dj1YSHp0a0ZDemJUVQ
is the URL base64-encoded. :)Edit: I know it's a bit beyond the scope of Clean Link's typical parameter allow/block-listing, but I think it'd be helpful to evolve it?