Email protection is not parsed by cloudscraper

Godzil commented 7 years ago

On Crunchyroll, they have activated the "email protection" from cloudflare and when I try to scrape some of their webpages I get "[email protected]" instead of the expected text.

Example: http://www.crunchyroll.com/the-idolmster-cinderella-girls-theater#

On that webpage the "DOLM@STER" is seen as an email by cloudflare (that mean their regexp is fabulously wrong) and it screw my scraping because I'm not expecting ton of the javascript they insert for that "protection".

It would be nice if cloudscraper could detect these insertion and parse them the same as it does for the browser detection.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Godzil commented 5 years ago

Haven’t tried on the latest version but I suspect this issue still exist..

codemanki commented 5 years ago

@Godzil didn't mean to close it.

Godzil commented 5 years ago

I know it was for the stall not to not close it :D

ghost commented 5 years ago

This a deobfuscated version of /cdn-cgi/scripts/********/cloudflare-static/email-decode.min.js Here's a gist of the original: https://gist.github.com/pro-src/1819f43cc272d77596156e0410a74c67#file-email-decode-min-js

console.log(decodeEmail('5118151e1d1c1102051403', 0));
// IDOLM@STER

document.querySelectorAll('a').forEach(function (anchor) {
  var prefix = '/cdn-cgi/l/email-protection#';
  var index = anchor.href.indexOf(prefix);
  if (index !== -1) {
    // Update the anchor's href attribute
    console.log('mailto:' + decodeEmail(anchor.href, index + prefix.length));
  }
});

document.querySelectorAll('.__cf_email__').forEach(function (anchor) {
  var hexStr = anchor.getAttribute('data-cfemail');
  // Replace the anchor with a text node that contains the email
  console.log(decodeEmail(hexStr, 0));
});

function decodeEmail (hexStr, start) {
  var email = '', key = parseInt(hexStr.substr(start, 2), 16);

  for (var codePoint, i = start + 2; i < hexStr.length; i += 2) {
    codePoint = parseInt(hexStr.substr(i, 2), 16) ^ key;
    email += String.fromCharCode(codePoint);
  }

  return decodeURIComponent(escape(email));
}

@codemanki We could add a decodeEmails boolean (off by default) option. If the response content type is text/html, update the href attributes and replace anchor elements with the decoded emails.

Godzil commented 5 years ago

No I’m not a spammer, their “email protection” is working so badly that it take absolutely not valid email and screwed some of my scrapping, which cause an issue for the project I’m working on.

codemanki / cloudscraper

Email protection is not parsed by cloudscraper #41