Closed Godzil closed 5 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Haven’t tried on the latest version but I suspect this issue still exist..
@Godzil didn't mean to close it.
I know it was for the stall not to not close it :D
This a deobfuscated version of /cdn-cgi/scripts/********/cloudflare-static/email-decode.min.js
Here's a gist of the original: https://gist.github.com/pro-src/1819f43cc272d77596156e0410a74c67#file-email-decode-min-js
console.log(decodeEmail('5118151e1d1c1102051403', 0));
// IDOLM@STER
document.querySelectorAll('a').forEach(function (anchor) {
var prefix = '/cdn-cgi/l/email-protection#';
var index = anchor.href.indexOf(prefix);
if (index !== -1) {
// Update the anchor's href attribute
console.log('mailto:' + decodeEmail(anchor.href, index + prefix.length));
}
});
document.querySelectorAll('.__cf_email__').forEach(function (anchor) {
var hexStr = anchor.getAttribute('data-cfemail');
// Replace the anchor with a text node that contains the email
console.log(decodeEmail(hexStr, 0));
});
function decodeEmail (hexStr, start) {
var email = '', key = parseInt(hexStr.substr(start, 2), 16);
for (var codePoint, i = start + 2; i < hexStr.length; i += 2) {
codePoint = parseInt(hexStr.substr(i, 2), 16) ^ key;
email += String.fromCharCode(codePoint);
}
return decodeURIComponent(escape(email));
}
@codemanki We could add a decodeEmails
boolean (off by default) option. If the response content type is text/html
, update the href
attributes and replace anchor elements with the decoded emails.
No I’m not a spammer, their “email protection” is working so badly that it take absolutely not valid email and screwed some of my scrapping, which cause an issue for the project I’m working on.
On Crunchyroll, they have activated the "email protection" from cloudflare and when I try to scrape some of their webpages I get "[email protected]" instead of the expected text.
Example: http://www.crunchyroll.com/the-idolmster-cinderella-girls-theater#
On that webpage the "DOLM@STER" is seen as an email by cloudflare (that mean their regexp is fabulously wrong) and it screw my scraping because I'm not expecting ton of the javascript they insert for that "protection".
It would be nice if cloudscraper could detect these insertion and parse them the same as it does for the browser detection.