diegocr / CleanLinks

Converts obfuscated/nested links to genuine clean links.
166 stars 45 forks source link

CleanLinks Pulling Result URL from 'referrer=' Value #147

Open ghost opened 8 years ago

ghost commented 8 years ago

CleanLinks changes

https://login.xmarks.com/?referrer=https%3A%2F%2Flastpass.com%2Ffeatures_joinpremiumxmarks2.php%3Flpuser%3Dlastpass%2540mailinator.com&append=1

to

https://lastpass.com/features_joinpremiumxmarks2.php?lpuser=lastpass%40mailinator.com

I would like CleanLinks to change the URL to

https://login.xmarks.com/

Though I would settle for

https://login.xmarks.com/?append=1

For the patterns in the Remove From Links field, I think you use '=' as an implied terminator. So

(?:ref|aff)\w*

should match

referrer=

But I tried to provide a more specific pattern for

referrer=

To the end of the Remove From Links defaults, I added

|(?:ref(?:er(?:r?er)?)?)\w*

I added my pattern to the end so I would not change any of the patterns you provide as defaults. I included 'ref' in my pattern so that my pattern could stand on its own and still match 'ref'.

With my pattern at the end, CleanLinks still changed the URL to

https://lastpass.com/features_joinpremiumxmarks2.php?lpuser=lastpass%40mailinator.com

I removed the other patterns. I inserted my pattern as the only pattern in Remove From Links. CleanLinks still changed the URL to

https://lastpass.com/features_joinpremiumxmarks2.php?lpuser=lastpass%40mailinator.com

I then inserted

referrer

as the only pattern in Remove From Links. No Regular Expression special characters to clutter up the works. CleanLinks still changed the URL to

https://lastpass.com/features_joinpremiumxmarks2.php?lpuser=lastpass%40mailinator.com

What is going on?

geokis commented 8 years ago

CL has an algorithm to detect nested and encoded links. In your case CL detects the url-encoded url:

https%3A%2F%2Flastpass.com%2Ffeatures_joinpremiumxmarks2.php%3Flpuser%3Dlastpass%2540mailinator.com

and drop off the remains.

The algorithm has a higher priority as the user pattern. The only part you can clean is lpuser=lastpass%40mailinator.com with lpuser

ghost commented 8 years ago

CL has an algorithm to detect nested and encoded links

No kidding. And for URLs like

https://login.xmarks.com/?referrer=https%3A%2%2Flastpass.com%2Ffeatures_joinpremiumxmarks2.php%3Flpuser%3Dlastpass%2540mailinator.om&append=1

the algorithm is wrong. []() []()

The algorithm has a higher priority as the user pattern

That is another problem. But that is a design problem. It is not part of the incorrect extraction of the target URL by the internal algorithm.

Leave the filter priority problem for another Issue. Fix the internal algorithm.